source: hugging face blog: direct preference optimization beyond chatbots

level: technical

supervised fine-tuning helps models learn a task but does not directly penalize repetition loops. in ocr, these loops can persist even after fine-tuning, with rates sometimes exceeding 33%. the problem lies in how sft works: it maximizes the likelihood of correct tokens one by one, never explicitly marking a full degenerate output as wrong. this leaves a structural gap where the model can still fall into self-reinforcing repetition patterns during generation.

direct preference optimization addresses this by training on full output pairs. the dharmaocr team took the sft model's own degenerate outputs and used them as rejected examples, paired with correct transcriptions as chosen examples. this gave dpo a clear signal to push the model away from the specific failure mode. the approach required no human labels, only an automated judge to score outputs. degenerate outputs were kept in the training data on purpose, not filtered out, because they represented exactly what the model needed to unlearn.

across five model families, dpo reduced degeneration in every case, with reductions ranging from 37% to 88%. one model saw degeneration rise after sft, then fall after dpo, showing that sft can increase exposure to failure modes while dpo corrects them. the consistency across different architectures and scales suggests that optimizing over complete preference pairs can target failure modes that token-level training misses. this method may apply to other structured generation tasks where clear failure patterns exist.

why it matters: it shows a practical way to reduce specific model failures without human feedback, using the model's own mistakes as training data.


source: hugging face blog: direct preference optimization beyond chatbots