hidden scale escapes per-loop loss in looped language models

source: arxiv machine learning: dense supervision is not enough: the readout blind spot in looped language models

level: research

looped language models reuse hidden states across steps, decoding each state for prediction and feeding it back into the next computation. a key question is which state variables are actually constrained by the per-loop cross-entropy loss. the answer is not all of them. the loss only supervises the variables that the readout layer exposes, leaving other recurrent variables free to drift.

a clear failure mode involves hidden-state scale. when readouts are scale-invariant, like those using rmsnorm or layernorm, the radial scale of the hidden state is invisible to the immediate loss. meanwhile, pre-norm residual recurrence keeps updating that same scale. as a result, per-loop loss can make early exits usable without ever controlling the recurrent scale. in experiments with 44m and 129m looped transformers lacking inter-loop normalization, per-loop cross-entropy through rmsnorm readouts still allowed final hidden-state norms to reach thousands or tens of thousands.

the findings point to a structural blind spot in dense supervision for looped models. scale-visible readouts and explicit norm constraints can address the issue, but the core problem is that cross-entropy only sees what the readout shows. this has practical implications for training stable looped architectures, as uncontrolled scale can harm downstream performance and reliability even when per-step losses appear low.

why it matters: understanding this blind spot helps ai practitioners design looped models that avoid runaway hidden-state scale, improving training stability and model reliability.

source: arxiv machine learning: dense supervision is not enough: the readout blind spot in looped language models