ratequant tunes kv cache precision for llms

large language models store key-value pairs from previous tokens in a kv cache, which grows with sequence length and becomes a memory bottleneck during generation. quantizing the cache to fewer bits reduces memory use, but current methods assign the same bit-width to every attention head. this ignores that some heads are more important than others for output quality. a natural improvement is mixed-precision quantization, giving more bits to important heads and fewer to less important ones.

the authors find a hidden problem with mixed-precision allocation: different quantizer designs have different distortion curves, meaning the relationship between bit-width and error varies. if one quantizer's distortion model is applied to another, the allocation order can invert, making performance worse than uniform quantization. this failure, called distortion model mismatch, occurs because the decay rate of distortion with bits differs across quantizers, ranging from 3.6 to 5.3 in their study.

to fix this, they propose ratequant, which fits a separate distortion model for each quantizer and uses rate-distortion theory to optimally allocate bits across heads. this ensures the allocation matches the actual quantizer behavior, avoiding mismatch. for ai practitioners, ratequant means more efficient serving of large language models with lower memory costs and better quality retention, directly impacting deployment scalability and cost.

Source: arXiv Machine Learning: RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory