source: pytorch blog: using muon optimizer with deepspeed

level: technical

muon optimizer is designed for 2d hidden weights in neural networks. it computes momentum from gradients and applies newton-schulz iterations to orthogonalize the momentum matrix. this orthogonalization equalizes singular values, amplifying rare but important update directions. muon uses only one momentum buffer, unlike adam's two, saving optimizer state memory. in benchmarks, muon improved training speed by 35% over adamw in nanogpt and reached gpt-2 xl performance 25% faster at 1.5b parameters.

deepspeed integrates muon by applying updates in the get_flat_partition function of zero stages 1 and 2, where gradients are still unflattened. parameters are tagged for muon if they are 2d and in hidden layers; others fall back to adamw. the hybrid approach uses separate learning rates for muon and adam parameters. fine-tuning experiments on moonlight-16b-a3b showed muon outperforming adamw on mbpp+, mmlu, and gsm8k, with better generalization on rigorous tests.

memory savings come from muon's single momentum buffer. for typical transformers, about 90% of parameters are 2d hidden weights, reducing optimizer state memory by roughly 45%. measured on qwen2.5-3b fine-tuning, muon cut peak gpu memory by 9% (3 gib) compared to adamw. this can help fit workloads on-device without cpu offloading. future deepspeed plans include zero stage 3 support, faster orthogonalization kernels, cpu offloading, and muonclip variant.

why it matters: muon optimizer reduces memory and speeds up convergence, making large model training more efficient and accessible on limited hardware.


source: pytorch blog: using muon optimizer with deepspeed