source: arxiv machine learning: latent cache flow: model-to-model communication without text

level: research

large language models usually talk to each other by generating and reading text. this process is slow because it requires decoding words one by one and then encoding them again. it also loses information that was present in the original model's internal state. prior work called cache-to-cache tried to fix this by sending key-value caches directly, but it needed big, costly adapters and required both models to have the same context, which is rare in agent setups.

latent cache flow solves these problems in two ways. first, it compresses and translates keys and values together, shrinking the adapter to about 4% of the cache-to-cache size. a 13 megabyte adapter is enough for early tests. second, it designs the adapter to transmit only a summary of new information that the receiver model lacks, so the models can have different contexts. this makes it practical for real agent communication where each model sees different data.

the approach reduces latency and preserves more meaning than text-based exchanges. by avoiding autoregressive decoding, it speeds up model-to-model interactions. the compressed adapter is cheaper to train and deploy. handling differing contexts means agents can specialize and still share relevant updates efficiently. early experiments show promise for multi-agent systems where fast, rich communication is key.

why it matters: faster, richer communication between ai agents can improve multi-step reasoning and collaboration without the bottleneck of text.


source: arxiv machine learning: latent cache flow: model-to-model communication without text