adaptive importance sampling fixes quantized rl training

source: arxiv statistics ml: ais: adaptive importance sampling for quantized rl

level: technical

reinforcement learning for large language models often uses low-precision rollouts, like fp8, to speed up generation and save memory. the trainer typically runs in bf16, creating a mismatch that biases the policy gradient. this bias can cause training to fail on reasoning tasks. the mismatch changes over time: early on, it adds helpful exploration by showing the trainer trajectories it would otherwise miss. later, as the policy becomes more focused, the same perturbation turns into a harmful bias that destabilizes learning.

the proposed solution, adaptive importance sampling, adjusts its correction per batch using three real-time signals: weight reliability, divergence, and a third diagnostic. it acts like a dynamic filter that strengthens or weakens its intervention based on how much the mismatch is hurting training at that moment. this avoids the need for manual tuning and adapts as the policy evolves. the method is designed to preserve the early exploration benefit while removing the later bias.

experiments on reasoning benchmarks show that adaptive importance sampling prevents the training collapses seen with standard quantized rl. it matches or exceeds the performance of full-precision training while keeping the speed and memory gains of low-precision rollouts. the approach is lightweight and does not require changes to the model architecture. it offers a practical way to make quantized rl reliable for large language model fine-tuning.

why it matters: it enables stable and efficient reinforcement learning for large language models using low-precision hardware, reducing cost and energy without sacrificing performance.

source: arxiv statistics ml: ais: adaptive importance sampling for quantized rl