Delta Attention Residuals: It's the Routing, Not the Sources

Cheng Luo, Zefan Cai, Junjie Hu — May 2026

TL;DR

This is a follow-up to our Open Attention Residuals work. We built Delta Attention Residuals, which beats both standard residuals and Attention Residuals from 220M to 7.6B (1.7–8.2% lower validation PPL). But after some good pushback online, we want to be precise about why it works — because our first framing was a bit too convenient.

The corrected story: The win is not "route over deltas instead of cumulative states." Attention Residuals already route over per-sublayer outputs (deltas). The actual lever is the routing formulation: additive routing ($\mathbf{h} = \tilde{\mathbf{h}} + \sum_i \alpha_i \mathbf{v}_i$) instead of replacement routing ($\mathbf{h} = \sum_i \alpha_i \mathbf{v}_i$). That single change is what preserves the residual stream, gives a safe zero-init, and lets you convert pretrained checkpoints by fine-tuning.

Where We Started (and what was sloppy)

Our initial pitch was: standard Attention Residuals route over cumulative hidden states $\mathbf{h}_i = \mathbf{h}_0 + \sum_{j\le i}\mathbf{v}_j$, which are highly redundant, so the softmax routing collapses toward uniform in deep layers (max weight ${\sim}0.2$). Delta routing fixes this by attending over deltas $\mathbf{v}_i = \mathbf{h}_{i+1} - \mathbf{h}_i$, keeping routing sharp (max weight ${\sim}0.6$).

That story is half right, and the half that's wrong matters. If you read the original Attention Residuals paper carefully:

So "delta vs cumulative" is not the clean dividing line we implied. Attention Residuals are already, to a large extent, attending over deltas. Thanks to @nrol_ling for pushing on this — it forced us to find the real mechanism.

The Two Axes That Actually Matter

It's cleaner to think about two orthogonal design choices:

AxisOptionsWhat it controls
Sourcecumulative state $\mathbf{h}_i$  vs  delta $\mathbf{v}_i$ / block-delta $\Delta_b$How distinctive the routing candidates are
Routingreplacement $\mathbf{h}=\sum\alpha_i\mathbf{s}_i$  vs  additive $\mathbf{h}=\tilde{\mathbf{h}}+\sum\alpha_i\mathbf{v}_i$Whether the residual stream survives

Mapping the methods onto this grid makes the picture obvious:

MethodSourceRoutingReset?
Block AttnRes (Kimi)block deltareplacementyes
Full AttnRes (Kimi)sublayer deltareplacementevery layer
Delta Block (ours)block deltaadditiveno
Delta AttnRes (ours)sublayer deltaadditiveno

Read the rows: our methods share the source type with Kimi's. The column that flips is routing. Delta Block vs Block AttnRes differ in exactly one thing — additive vs replacement.

Why Additive Routing Is the Lever

Replacement routing throws the residual stream away and rebuilds the hidden state from a convex combination of sources:

# Replacement (Attention Residuals): the stream is overwritten
h = sum(alpha_i * v_i)            # softmax weights sum to 1

# Additive (Delta Attention Residuals): the stream is preserved
h = h_tilde + sum(alpha_i * v_i)  # routing augments, never replaces

Three consequences follow, and all three are about the residual stream, not the sources:

  1. Residual preservation. The gradient highway $\tilde{\mathbf{h}}$ is always there. Replacement breaks it whenever the softmax doesn't happen to reconstruct it.
  2. Safe initialization. With zero-init queries the softmax is uniform, so $\sum\alpha_i\mathbf{v}_i$ is just a small bounded perturbation added to $\tilde{\mathbf{h}}$ — the layer is the identity at step 0. A replacement layer at uniform init returns the mean of the sources, which is not the residual sum, so it cannot reduce to a vanilla transformer.
  3. No information loss at boundaries. Because we never reset, every sublayer's contribution stays individually reachable.

The safe-init property is the practical payoff: it's what lets you take an off-the-shelf pretrained checkpoint and fine-tune it into a Delta Attention Residual model with zero disruption at step 0.

The Evidence: Replacement Degrades at Scale

From-scratch training on FineWeb-Edu, 10K steps, matched architecture/data/hyperparameters per scale. Validation perplexity (lower is better):

ScaleBaselineAttnRes (replace)Full AttnRes (replace)Delta Block (add)Delta AttnRes (add)
220M38.7137.3937.3037.0836.83
533M32.0031.7531.6831.1631.05
1044M29.7031.76 (+6.9%)33.36 (+12.3%)29.1929.13
7.57B17.4318.58 (+6.6%)16.00 (−8.2%)
At small scale, replacement routing is fine — it even helps. But as depth grows, replacement degrades below the plain-residual baseline: +6.9% at 1B and +6.6% at 8B. The most aggressive replacement variant (Full AttnRes, reset every layer) is the worst at 1B (+12.3%). Additive routing turns that same setup into a consistent win at every scale, peaking at −8.2% PPL at 7.6B.

This is the cleanest argument for the two-axis view: holding the source fixed (block deltas) and flipping only the routing (replacement → additive) converts a method that degrades into one that improves.

So What About Routing Collapse?

Routing collapse is still real — we do observe the max softmax weight fall to ${\sim}0.2$ in deep layers for the replacement baselines, while our additive delta routing holds ${\sim}0.6$ (1.8× higher average max weight, 0.62 vs 0.35). But we now read it as a symptom rather than the root cause. Redundancy among routing candidates lowers the contrast the softmax can express; resetting and replacing the stream compounds the damage at depth. Additive routing sidesteps both, which is why the routing stays sharp and the loss stays low.

Cost

Delta Block is the practical default. At 7.57B it adds 589.8K routing parameters (0.008%) and ~3% memory. It's also faster and lighter than Attention Residuals (14.0k vs 12.5k tok/s, 42.7 vs 44.0 GB) precisely because additive routing avoids the costly hidden-state replacement and reset operations.

Lessons (Updated)

  1. Name the right variable. Our gain comes from additive vs replacement routing, not from inventing delta sources — Attention Residuals already route over deltas.
  2. Preserve the residual stream. Replacement routing discards the gradient highway and gets worse with depth; additive routing keeps it and scales.
  3. Safe zero-init is the unlock for fine-tuning. Identity-at-init is only possible because routing adds to the stream rather than replacing it.
  4. Ablate the axis you claim. To isolate the lever, fix the source (block deltas) and flip only the routing. That's the comparison that actually supports the story.
  5. Public pushback makes the paper better. The sharpest framing of this work came out of a Twitter thread, not the first draft.

Quick Start

# Train from scratch (Delta Block, recommended)
torchrun --nproc_per_node=8 train_scratch.py --mode delta_block --num_blocks 4

# Same-source ablation: delta sources, but replacement routing
torchrun --nproc_per_node=8 train_scratch.py --mode delta_replace_block --num_blocks 4

# Convert a pretrained checkpoint via fine-tuning (additive, zero-init)
torchrun --nproc_per_node=8 train_finetune.py --mode delta_block

# Evaluate downstream
python eval_downstream.py --model_path output/.../final --mode delta_block --tasks paper

Citation

@article{luo2026delta,
  title={Delta Attention Residuals},
  author={Cheng Luo and Zefan Cai and Junjie Hu},
  url={https://github.com/wdlctc/delta-attention-residuals-code},
  year={2026}
}

@article{kimi2025attention,
  title={Attention Residuals},
  author={Kimi Team},
  journal={arXiv preprint arXiv:2603.15031},
  year={2025}
}