source: arxiv statistics ml: dynamics of stochastic momentum with sparse updates in high dimensions
level: research
standard momentum theory assumes gradients arrive at every parameter at a roughly constant rate. this breaks down with heavy-tailed data and modern architectures where updates are sparse. we analyze two tractable models: least squares with sparse inputs and logistic regression with a rare class. both yield exact closed-form second-moment dynamics, and we characterize their high-dimensional limits across three scaling exponents for sparsity, batch size, and momentum decay.
the phase structure depends on the ratio of two intrinsic timescales. the momentum retention timescale measures how many active updates the buffer survives. the learning timescale measures how many active updates it takes to reduce the squared error. when learning is much slower than retention, the limit matches sgd. when learning is faster, the system becomes unstable. the boundary between these regimes defines when momentum is beneficial or detrimental.
the analysis provides precise conditions for stable momentum in sparse regimes. it shows that momentum can amplify noise when updates are infrequent, leading to worse performance than sgd. the findings apply to modern training scenarios like large language models with heavy-tailed token distributions or recommender systems with rare events. practitioners can use the timescale ratio to tune momentum decay and batch size for stability.
why it matters: it gives a principled way to decide when to use momentum in sparse training, preventing instability and wasted compute.
source: arxiv statistics ml: dynamics of stochastic momentum with sparse updates in high dimensions