source: arxiv machine learning: gem: geometric entropy mixing for optimal llm data curation

level: research

large language model pretraining now depends more on data composition than on total data size. current methods for mixing data have problems. human-made categories often do not match the model's internal structure. standard clustering in euclidean space fails because word embeddings are not evenly spread out, a problem called anisotropy. these issues lead to poor data mixtures that can hurt model performance.

the gem framework treats data curation as a mathematical problem on a hypersphere. it adds a mixing-balance regularizer to avoid clusters collapsing into one dominant group. a minorize-maximize algorithm solves this problem, finding balanced semantic structures that euclidean methods miss. to handle web-scale data, gem uses teacher-student distillation, where a smaller model learns to mimic the geometric fidelity of a larger one. the framework also produces a geometric influence score to create interpretable taxonomies.

experiments with 1.1 billion parameter models show gem sets a new standard for data mixing. the approach discovers data groupings that are both diverse and balanced, leading to better training outcomes. by focusing on the geometry of embeddings, gem avoids the biases of human labeling and the failures of simple clustering. this makes it a practical tool for improving how we select and combine training data for large models.

why it matters: better data mixing directly improves llm training efficiency and downstream performance without needing more data.


source: arxiv machine learning: gem: geometric entropy mixing for optimal llm data curation