level: research
this paper examines how attention pooling influences signal recovery in sequence models. the authors consider token embeddings drawn from a two-class gaussian mixture and pooled using fixed attention weights. they work in a high-dimensional limit where embedding dimension, vocabulary size, and sequence length all grow proportionally. the goal is to understand when the pooled representation retains enough information to distinguish the underlying class signal.
the analysis derives exact formulas for the limiting eigenvalue distribution of the sample covariance matrix. the bulk spectrum follows a non-standard law given by a free multiplicative convolution of two marchenko-pastur distributions. this reflects the finite vocabulary structure. the authors identify two phase transitions that govern signal recovery. these transitions depend on the attention weights, the positional correlation structure, and the dimensional ratios. an outlier eigenvalue emerges when the signal strength exceeds a threshold, indicating detectable class information.
the results show that attention weights act as a tunable parameter for signal detection. by adjusting the weights, one can shift the phase transition boundary. the paper provides precise conditions under which the leading eigenvector aligns with the true signal direction. this alignment is crucial for downstream tasks like classification. the findings offer a theoretical lens on how attention mechanisms in transformers might filter noise and amplify relevant features in high-dimensional data.
why it matters: it gives precise conditions for when attention pooling preserves class information, guiding architecture design for sequence models.