signmuon: 1-bit matrix-aware optimizer for distributed training

source: arxiv machine learning: signmuon: communication-efficient distributed muon optimization

level: research

distributed training of large neural networks often slows down due to full-precision gradient communication and optimizers that ignore weight matrix structure. signmuon addresses both issues by merging majority-vote sign aggregation from signsgd with the polar-step framework of muon. each worker computes a muon-style direction by taking the polar factor of its momentum using a newton-schulz iteration, then transmits only the entrywise signs. the server aggregates these signs by majority vote. an optional local polar step can enforce orthogonality without extra communication cost.

the method provides theoretical guarantees. under spectral-norm smoothness and bounded-variance stochastic gradients, the spectral-norm normalized sign step achieves a nonconvex convergence rate of order 1 over square root of t for an l1-based stationarity measure. with unimodal symmetric noise, majority vote across m workers reduces the stochastic term by 1 over square root of m, matching signsgd's scaling. this means signmuon retains the communication efficiency of 1-bit methods while incorporating matrix structure awareness.

by using only 1 bit per gradient element, signmuon drastically cuts communication overhead compared to full-precision methods. the matrix-aware update, inspired by muon, helps preserve weight matrix geometry, potentially leading to better optimization for modern architectures like transformers. the approach is simple to implement and compatible with existing distributed training frameworks. it offers a practical way to scale training across many workers without sacrificing optimizer quality.

why it matters: it enables faster distributed training of large models by reducing communication costs while maintaining optimizer performance through matrix-aware updates.

source: arxiv machine learning: signmuon: communication-efficient distributed muon optimization