quantization undoes alignment in compressed llms

source: arxiv machine learning: quantization undoes alignment: bias emergence in compressed llms across models and precision levels

level: research

large language models are often compressed after training using quantization to lower inference costs and memory use for cloud and edge deployment. however, the effect of this compression on model quality is not well understood. most studies compare only full-precision models to a single quantized version, use broad bias metrics, and test only one model family. this makes it hard to tell gradual degradation from sudden safety failures.

researchers tested three instruction-tuned models—qwen2.5-7b, mistral-7b, and phi-3.5-mini—at five precision levels from bf16 down to 3-bit. they used 12,148 items from the bbq bias benchmark across five random seeds, producing 911,100 inference records. the results showed that 3-bit quantization caused 6 to 21 percent of previously unbiased items to show new stereotypical behaviors. the effect followed a clear dose-response pattern, confirmed by logistic regression.

the findings highlight a trade-off between model compression and alignment. as quantization becomes more aggressive, the risk of introducing harmful biases increases. this is important for practitioners deploying quantized models in sensitive applications, where maintaining fairness and safety is critical. the study suggests that current evaluation methods may miss these gradual shifts, and more fine-grained testing is needed when compressing models for real-world use.

why it matters: deploying quantized models without checking for bias can lead to unfair or harmful outputs, especially in sensitive applications.

source: arxiv machine learning: quantization undoes alignment: bias emergence in compressed llms across models and precision levels