source: arxiv statistics ml: uniform scaling limits in adamw-trained transformers
level: research
researchers studied the behavior of very deep transformers trained with the adamw optimizer. they modeled the hidden-state dynamics as an interacting particle system coupled through attention. under proper scaling of attention heads, they proved that the joint dynamics of hidden states and backpropagated variables converge in l2 to a forward-backward system of ordinary differential equations. the convergence is uniform over initial conditions and happens at a rate of order l^{-1} + l^{-1/3} h^{-1/2}, where l is depth and h is the number of heads.
when causal masking is not used, the limiting ode system becomes a mckean-vlasov ode. the authors used flow maps from this ode and concentration of measure techniques to bound the difference between the discrete transformer and the continuous model. these bounds hold uniformly over compact sets of initial conditions, giving a rigorous link between finite-depth transformers and their infinite-depth limits.
this work provides a theoretical foundation for understanding how transformer depth and width affect training dynamics. the explicit convergence rates can guide architecture scaling and hyperparameter choices. the uniform nature of the results means they apply broadly, not just for specific initializations. the findings may help explain empirical observations about transformer training stability and could inform the design of more efficient training algorithms.
why it matters: it gives precise mathematical rates for how transformer depth and head count affect training, helping practitioners choose architectures and tune adamw more effectively.
source: arxiv statistics ml: uniform scaling limits in adamw-trained transformers