offline reasoning training methods converge to similar weight updates

source: arxiv machine learning: weight-space geometry of offline reasoning training

level: research

researchers trained six offline reinforcement-learning methods on math reasoning tasks using the same base model and data. they compared sft, rft, rift, dft, offline grpo, and dpo by analyzing weight deltas with cosine similarity, principal angles, linear mode connectivity, and cka. sft, rft, and rift had almost colinear weight updates, with cosine similarity at least 0.97 and top principal angle around 7 degrees across 144 modules. these three methods also reached similar gsm8k accuracy of 87 to 88 percent, with no statistically significant difference.

dft diverged more in weight direction than any reward-weighted method, even though it used the same training data. offline grpo added a large orthogonal component to the weight update, making it mechanistically distinct. the study used attention-only lora on qwen3-4b with identical math rollouts, isolating the effect of the loss function. the findings suggest that many popular offline reasoning losses may not learn fundamentally different representations.

the results question whether complex loss functions are needed when simpler methods yield nearly identical internal changes. for practitioners, this means sft or rft might be sufficient for distilling reasoning, reducing computational overhead. the work also highlights the value of weight-space analysis beyond accuracy benchmarks to understand what models actually learn.

why it matters: it shows that many offline reasoning training methods produce similar model internals, so simpler approaches may save compute without losing performance.

source: arxiv machine learning: weight-space geometry of offline reasoning training