source: arxiv machine learning: reducing credit assignment variance via counterfactual reasoning paths

level: research

reinforcement learning for multi-step reasoning with large language models often uses only a final reward, which gets spread evenly across all steps. this poor credit assignment creates high gradient variance, unstable training, and many useless updates, stopping the model from improving. the paper introduces a counterfactual comparison framework that samples several reasoning paths for the same input. by looking at the differences between these paths, it builds an implicit advantage estimator for each step, turning a single end reward into step-sensitive learning signals.

the proposed algorithm, implicit behavior policy optimization (ibpo), uses these implicit process-level advantages to guide policy updates. instead of needing explicit step-by-step rewards, it approximates how much each action contributed by comparing outcomes from alternative trajectories. this reduces variance in the credit assignment and makes training more stable. experiments on mathematical reasoning tasks show ibpo raises performance ceilings and avoids the collapse seen with standard methods.

the approach works by treating the variation among sampled trajectories as a natural experiment. when one path succeeds and another fails, the steps where they differ get stronger learning signals. this sidesteps the need for costly human annotations or separate reward models for each step. the method is compatible with existing policy gradient algorithms and can be added to llm fine-tuning pipelines with minimal overhead.

why it matters: it makes reinforcement learning for llm reasoning more reliable by fixing credit assignment, which is a key bottleneck for training models that solve complex problems step by step.


source: arxiv machine learning: reducing credit assignment variance via counterfactual reasoning paths