optimizing vlm reward models with few demonstrations

source: arxiv machine learning: from demonstrations to rewards: test-time prompt optimization for vlm reward models

level: research

reinforcement learning needs accurate reward functions, but in real-world tasks like robotics, hand-crafted rewards are often missing. recent work uses pre-trained vision-language models as zero-shot reward models, yet without careful prompt design, they give suboptimal rewards. false positive predictions can severely hurt downstream policy learning. in robotics, small sets of expert demonstrations are commonly collected to start policy learning. this setup allows optimizing the reward model before training the policy.

the proposed method, demo2reward, adapts the language instruction of a vlm reward model at test time using only a few demonstrations, typically 3 to 10 trajectories. it aims to cut false positives while keeping true positives intact. the approach needs no extra model training or computation, making it lightweight and practical. by refining the prompt based on expert examples, the reward model becomes more aligned with desired behavior, leading to better policy learning.

experiments show that demo2reward improves reward accuracy and downstream task performance compared to standard zero-shot prompting. the technique is especially useful when reward engineering is hard or impossible. it leverages existing demonstration data that is often already available in robotics pipelines. this work highlights how small data and clever prompt optimization can replace costly reward design, making reinforcement learning more accessible in real-world settings.

why it matters: it enables better reinforcement learning in robotics by fixing reward model errors with minimal data, avoiding manual reward engineering.

source: arxiv machine learning: from demonstrations to rewards: test-time prompt optimization for vlm reward models