level: research
offline reinforcement learning usually needs per-step reward signals, but many real datasets only have a single outcome label per trajectory. this paper builds a statistical framework for learning policies from such trajectory-level supervision. the authors study the case where each trajectory provides a scalar label whose expected value equals the cumulative return. they propose opac, a pessimistic actor-critic method that learns a latent reward model and then optimizes a policy using only these coarse labels.
the main result is a high-probability performance bound of order h^2 sqrt(c_sa(pi_star)/n), where h is the horizon, c_sa is a coverage term, and n is the number of trajectories. a matching lower bound shows this rate is tight, revealing the exact statistical cost of losing per-step rewards. the analysis also extends to preference-based feedback, where only pairwise comparisons between trajectories are available, and the method preserves the same leading horizon dependence.
the work clarifies when trajectory-level supervision is enough for efficient offline rl. it shows that with proper pessimism and latent reward modeling, one can match the sample complexity of process-level methods up to horizon factors. this matters for applications like healthcare or dialogue systems where only final outcomes or human preferences are logged, not dense rewards. the theory provides guidance on how much data is needed and how to design algorithms for such settings.
why it matters: it shows how to train rl policies from coarse outcome labels, which is common in real-world logs where per-step rewards are missing.