source: arxiv machine learning: the constraint tax: measuring validity-correctness tradeoffs in structured outputs for small language models

level: research

production systems often need structured outputs like json objects or tool-call schemas. for small language models under 3 billion parameters, the usual idea is that hard output constraints improve reliability without changing the answer. this paper shows that assumption is unsafe for small models. the authors introduce constraint tax, a way to measure how much answer accuracy drops when structured-output rules are applied, using the same model, task, and problem instances.

the study ran 15,000 generations on commodity gpus with models like qwen2.5-0.5b, qwen2.5-1.5b, and smollm2-1.7b. without constraints, schema validity was 61.5 percent. hard schema decoding raised validity to 100 percent, but the correctness of the answers fell. the constraint tax quantifies this loss, showing that forcing valid structure can degrade the actual task performance. the effect is consistent across different small models and tasks.

the findings matter for on-device and low-cost deployments where small models are chosen for privacy, latency, or hardware limits. engineers cannot assume that adding output constraints is free. the constraint tax provides a metric to evaluate the tradeoff between getting valid formats and keeping answers correct. this helps developers decide when to use structured decoding and when to accept some format errors to preserve answer quality.

why it matters: it gives ai engineers a concrete metric to balance output format validity against answer accuracy when deploying small language models in production.


source: arxiv machine learning: the constraint tax: measuring validity-correctness tradeoffs in structured outputs for small language models