fully looped transformer stabilizes iterative block reuse

source: arxiv machine learning: simply stabilizing the loop via fully looped transformer

level: research

looped transformers reuse the same blocks multiple times, improving performance without adding parameters. this lets models scale by spending more compute at inference instead of growing in size. however, training becomes unstable when the loop count rises. the paper identifies two causes: gradient oscillation and residual explosion. gradient oscillation happens when parameter updates from different iterations conflict. residual explosion occurs because each loop adds to the hidden state, making values blow up.

the proposed fully looped transformer adds two simple, parameter-free fixes. first, a fully looped architecture spreads inter-loop signals across all layers, not just the block boundaries. this dampens the accumulation that leads to residual explosion. second, attention injection reuses attention states from earlier loops, reducing variance in gradients. together, these changes let the model train stably with many more iterations than before.

experiments show the fully looped transformer matches or beats standard transformers on language tasks while using fewer parameters. it also scales better with loop count, offering a smooth trade-off between compute and accuracy. the method works without extra tuning or complex modifications, making it easy to adopt. this could make large models cheaper to train and deploy, since performance comes from repeated computation rather than massive parameter counts.

why it matters: it enables stable training of looped transformers, allowing models to scale performance through test-time compute instead of parameter growth, which can reduce hardware costs and improve efficiency.

source: arxiv machine learning: simply stabilizing the loop via fully looped transformer