the migration from vllm v0 to v1 as an inference engine for pipelinrl revealed a train-inference mismatch. the initial v1 run showed trainer-side metrics like clip rate, kl divergence, entropy, and reward diverging from the v0 reference. the core issue was that v1 returned raw logprobs by default, while the trainer expected processed logprobs after temperature scaling and filtering. setting logprobs-mode to processed_logprobs fixed the mean offset, but other gaps remained due to runtime defaults and weight update handling.

further fixes aligned v1's runtime behavior with v0. disabling prefix caching and async scheduling removed v1-only optimizations that affected cache lifetime and request handling during online weight updates. the inflight weight update path was adjusted to match v0's approach of blocking execution, loading new weights, and resuming without explicit cache invalidation. these changes reduced persistent lag in weight synchronization, bringing the training curves closer to the reference.

the final parity required matching the numerical precision of the final projection layer. the trainer used an fp32 lm_head, but v1 did not, causing small logit differences that propagated into policy ratios and clipping. switching to fp32 for the head aligned the rollout logprobs with the trainer's expectations, resulting in reward curves that tracked the v0 reference. this matters for ai and data science because online rl systems depend on consistent logprob computation; backend correctness must be verified before applying objective-side corrections for staleness or asynchrony.


source: hugging face blog: vllm v0 to v1: correctness before corrections in rl