emo: pretraining mixture of experts for emergent modularity

large language models are often monolithic, but many tasks need only a fraction of their capabilities. mixture-of-experts models use smaller expert networks, activating only a few per token, yet existing moes still require all experts for good performance because experts specialize in low-level patterns like punctuation rather than meaningful domains. emo addresses this by encouraging experts to form coherent, domain-specific groups during pretraining, without relying on predefined categories.

emo uses document boundaries as a weak signal: all tokens in a document share a pool of experts chosen by the router. this forces consistent expert usage within a document, leading to emergent specialization in areas like health, politics, or music. global load balancing ensures all experts are used across documents, and varying the pool size during training lets emo support different subset sizes at inference. a 1b-active, 14b-total-parameter emo trained on 1 trillion tokens matches a standard moe on general benchmarks.

when using only 12.5% of experts, emo retains near full-model performance, while a standard moe degrades sharply. selecting experts is cheap: a single example with few-shot prompts works as well as a full validation set. this matters for ai and data science because it enables flexible deployment of large models with lower memory and compute costs, and the emergent modularity could improve interpretability and model updating. the model, baseline, and code are released for further study.

Source: Hugging Face Blog: EMO: Pretraining mixture of experts for emergent modularity