turboquant compresses llm caches to 3 bits

source: kdnuggets: turboquant: is the compression and performance worth the hype?

level: technical

turboquant is a new compression library from google for large language models and vector search. it reduces key-value cache memory to just 3 bits per element. the method avoids the memory overhead of traditional quantization by using a two-stage process. first, polarquant maps vector coordinates to a polar system, removing the need for extra quantization constants. second, qjl applies a one-bit compression to fix any biases introduced by the first stage. together, they achieve high compression without retraining the model or losing accuracy.

in tests with a 1.1 billion parameter model, turboquant cut cache memory from 42.45 mb to 7.86 mb, a 5.4x reduction. the speedup was only 0.61x on short sequences, but the real gains appear with longer contexts. on enterprise hardware with over 32,000 tokens, throughput can increase up to 8x. the library is easy to install via pip and works with hugging face models. a simple benchmark script shows how to compare performance with and without turboquant's 3-bit cache.

the compression works by tackling the linear growth of kv cache with context length. traditional methods often need full-precision constants per data block, adding memory overhead. turboquant's polar mapping simplifies data geometry, and the qjl step ensures no hidden errors remain. this makes it practical for retrieval-augmented generation systems where large caches slow down inference. developers can test it on free gpu tiers like google colab's t4 to see memory savings firsthand.

why it matters: it enables running large language models with much less memory, making real-time ai applications cheaper and faster.

source: kdnuggets: turboquant: is the compression and performance worth the hype?