gaussian mixture attention cuts transformer cost

source: arxiv machine learning: gaussian mixture attention: linear-time sequence mixing via probabilistic latent routing

level: research

standard dot-product attention in transformers requires comparing every token to every other token, which becomes a major bottleneck for long sequences. gaussian mixture attention replaces this dense pairwise interaction with a routing mechanism through a small set of learned gaussian mixture components. queries and keys are each mapped to responsibility vectors over these components, and their overlap defines an implicit affinity without ever computing the full n-by-n attention matrix.

the method uses a latent memory with k slots, where k is the number of mixture components. values are written into and read from this memory using the responsibility matrices. because matrix multiplication is associative, the computation can be reordered to avoid materializing the large affinity matrix. the dominant storage scales with n times k instead of n squared, making it feasible for very long contexts when k is fixed and much smaller than n.

the paper presents both bidirectional and causal variants of the approach. by framing attention as probabilistic routing, the model learns to softly assign tokens to mixture components, which then mediate information exchange. this keeps the mixing operation linear in sequence length while still allowing tokens to interact indirectly through the shared latent space. the design aims to preserve the benefits of attention-style mixing without the quadratic complexity.

why it matters: this approach could make transformers practical for much longer sequences, reducing memory and compute costs in tasks like long-document processing or high-resolution data analysis.

source: arxiv machine learning: gaussian mixture attention: linear-time sequence mixing via probabilistic latent routing