large language models struggle with long inputs because the key-value cache memory grows linearly with sequence length. existing compression methods use heuristic rules to decide which tokens to keep or drop, such as fixed budgets or attention-based importance scores. these heuristics often misallocate resources because they do not consider the specific task. the new lkv method treats kv cache eviction as a differentiable optimization problem, learning directly from the task loss how to compress the cache.

lkv has two parts: lkv-h learns per-head budgets, and lkv-t selects tokens without computing full attention matrices. this avoids the heavy computation of materializing attention scores while still capturing token importance. by training end-to-end, the compression aligns with the model's final objective, not with proxy statistics. experiments on longbench and ruler benchmarks show lkv achieves state-of-the-art performance, meaning it retains more useful information under tight memory constraints.

for ai practitioners, this matters because it enables longer context processing on limited hardware without manual tuning. the learned budgets adapt to different tasks and layers, potentially reducing memory usage and latency. the approach could be integrated into existing llm serving systems to improve throughput. future work may extend this to other memory bottlenecks in transformers, making large-scale language models more practical for real-world applications.


Source: arXiv Machine Learning: LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction