what reward learning gets from best-of-n preference data

source: arxiv statistics ml: reward learning from best-of-$n$ preference data: targets, tradeoffs, and design principles

level: research

best-of-n sampling creates pairwise preference data by drawing n candidates from a base distribution and pairing the best with a rejected response. this work examines what bradley-terry reward learning actually extracts from such data. for independent-reference variants, the learned reward targets are explicit functions of n and the base distribution, and they preserve the latent reward ranking. this means the model recovers the true ordering of responses even if the absolute reward values shift.

in practical best-vs-random and best-vs-worst setups, chosen and rejected responses come from the same candidate set, so exact bradley-terry representability usually fails. however, as n increases, bounded-class minimizers approach the reference targets. the analysis also connects to known factors like margin and connectivity that influence sample efficiency. these findings clarify how design choices like n and the base distribution shape the learned reward.

the results provide concrete guidance for practitioners using best-of-n data to train reward models. by choosing n and the base distribution appropriately, one can control the target reward function and avoid unintended biases. the work also highlights limitations: in coupled designs, the bradley-terry model is misspecified, but the approximation improves with larger n. this helps explain when and why best-of-n preference data works well in practice.

why it matters: understanding what reward models learn from best-of-n data helps ai practitioners design better preference datasets and avoid hidden biases in alignment pipelines.

source: arxiv statistics ml: reward learning from best-of-$n$ preference data: targets, tradeoffs, and design principles