Abstract

Looped language models turn hidden states into runtime state: each state is decoded for prediction and fed back into future computation. This raises a basic supervision question — which state variables does cross-entropy actually control? We show that dense per-loop cross-entropy controls the variables exposed by the readout, not every variable active in the recurrent transition.

Hidden-state scale gives a concrete failure mode. Scale-invariant readouts such as RMSNorm and LayerNorm hide radial scale from the immediate cross-entropy loss, while pre-norm residual recurrence continues to carry and update that scale. Per-loop loss can therefore make early exits usable without controlling recurrent scale. In 44M and 129M looped transformers without inter-loop normalization, per-loop cross-entropy through RMSNorm readouts still drives final hidden-state norms into the thousands or tens of thousands. Scale-visible readouts and explicit norm penalties keep norms in the tens, and scale-removing recurrence is the complementary architectural fix.

10⁴ 10² 10¹ RMSNorm readout · ‖H_K‖ ≈ 39,207 raw / norm-penalty · ‖H_K‖ ≈ tens recurrent depth k →

The resulting design rule is simple:

Per-loop cross-entropy trains exit usability; recurrent scale control requires either making hidden-state scale visible to a loss or removing it from the recurrent path.

Consistent with this rule, scale-controlled variants achieve lower perplexity at matched inference-depth in our variable-depth benchmarks.

The blind spot

Looping changes the role of a hidden state. In a fixed-depth transformer, a late hidden state is mostly an interface to the output head. In a looped model, the same hidden state is also runtime state: it is decoded for prediction and then reused as input to future computation. So a simple question becomes unavoidable — which parts of the recurrent state does the supervised loss actually control?

Dense supervision is the natural answer: apply cross-entropy at every loop, and each intermediate state receives direct prediction supervision. This trains the prediction interfaces, so early exits become usable. But dense CE does not automatically control every variable carried by the recurrent state. It controls what the readout makes visible. If an active recurrent variable is hidden by every supervised readout, the model can be trained at every loop while that variable stays unconstrained.

Mechanism: visible losses control visible coordinates

Write a hidden state as scale times direction, . The failure mode is a mismatch between what the recurrent state carries and what the loss can see:

  • Readout path. A scale-invariant readout (RMSNorm/LayerNorm) removes scale before the logits, so the immediate CE loss has approximately zero radial gradient. Multiplying the hidden state by a positive scalar leaves the prediction unchanged — the loss is blind to .
  • Recurrent path. The pre-norm residual update carries the skip state forward, so scale stays active and the learned update can change it. Scale drifts.

The paper makes this precise with two short lemmas — visibility (scale-invariant readouts remove the immediate radial CE signal) and activity (pre-norm residual loops carry scale at leading order) — and confirms the predicted slow-angular-motion regime directly in trained checkpoints.

What the experiments show

We train autoregressive looped transformers on WikiText-103 with recurrent applications of a shared decoder stack (44M: , 8 layers; 129M: , 12 layers), 4 epochs over 3 seeds. The core ablation crosses loss placement with readout: {terminal-only CE, per-loop CE} × {RMSNorm readout, raw readout}.

  • The key negative result. Per-loop CE through an RMSNorm readout receives a loss at every loop, yet reaches final-loop norms of 39,207 (44M) and 56,051 (129M). Dense supervision did not control scale.
  • The interventions work. Raw readouts keep norms in the tens; an explicit norm penalty (a one-line auxiliary loss) collapses them to 17–22, even with the same RMSNorm readout — so raw readout is not the only fix.
  • It shifts the depth–quality frontier. Scale-controlled per-loop models stay usable at but keep improving with more loops. At equal throughput they sit 0.30 PPL below RMSNorm at and 0.42 PPL below at full depth — a compute–quality gain, not a speedup claim.

Takeaway

Dense supervision trains exits. Controlling an active recurrent variable is a separate requirement: make it visible to a loss, or remove it from the loop. Raw readouts, explicit penalties, and scale-removing recurrence are different engineering points on that same requirement.