level: research
transformers can learn new tasks from examples inside a prompt, a skill called in-context learning. researchers often study how training task diversity affects this ability. diversity is usually defined in two ways: the number of training task vectors, or the number of function classes those vectors come from. the second definition has led to many empirical findings, but a clear theoretical explanation was missing.
this paper introduces a simple analytical model where training task vectors are drawn from a mixture of low-rank gaussians. each gaussian's covariance matrix lives in a low-dimensional subspace. task diversity is measured by how many columns of these subspaces do not overlap. the model proves that more non-overlapping columns lead to better in-context learning. the reason is that diverse subspaces reduce interference between tasks during training, allowing the transformer to form clearer internal representations.
the analysis shows that when subspaces share many directions, the model struggles to separate tasks. with more distinct subspaces, the transformer can learn a more structured weight matrix that generalizes well to new tasks from the same distribution. the findings match earlier empirical observations and provide a principled way to think about data design for training language models. the work focuses on linear transformers and simple regression tasks, but the insights may extend to more complex settings.
why it matters: understanding how task diversity improves in-context learning can guide the creation of better training data for large language models, making them more adaptable to new tasks without fine-tuning.