This is a follow-up to our Open Attention Residuals work. We built Delta Attention Residuals, which beats both standard residuals and Attention Residuals from 220M to 7.6B (1.7–8.2% lower validation PPL). But after some good pushback online, we want to be precise about why it works — because our first framing was a bit too convenient.
Our initial pitch was: standard Attention Residuals route over cumulative hidden states $\mathbf{h}_i = \mathbf{h}_0 + \sum_{j\le i}\mathbf{v}_j$, which are highly redundant, so the softmax routing collapses toward uniform in deep layers (max weight ${\sim}0.2$). Delta routing fixes this by attending over deltas $\mathbf{v}_i = \mathbf{h}_{i+1} - \mathbf{h}_i$, keeping routing sharp (max weight ${\sim}0.6$).
That story is half right, and the half that's wrong matters. If you read the original Attention Residuals paper carefully:
It's cleaner to think about two orthogonal design choices:
| Axis | Options | What it controls |
|---|---|---|
| Source | cumulative state $\mathbf{h}_i$ vs delta $\mathbf{v}_i$ / block-delta $\Delta_b$ | How distinctive the routing candidates are |
| Routing | replacement $\mathbf{h}=\sum\alpha_i\mathbf{s}_i$ vs additive $\mathbf{h}=\tilde{\mathbf{h}}+\sum\alpha_i\mathbf{v}_i$ | Whether the residual stream survives |
Mapping the methods onto this grid makes the picture obvious:
| Method | Source | Routing | Reset? |
|---|---|---|---|
| Block AttnRes (Kimi) | block delta | replacement | yes |
| Full AttnRes (Kimi) | sublayer delta | replacement | every layer |
| Delta Block (ours) | block delta | additive | no |
| Delta AttnRes (ours) | sublayer delta | additive | no |
Read the rows: our methods share the source type with Kimi's. The column that flips is routing. Delta Block vs Block AttnRes differ in exactly one thing — additive vs replacement.
Replacement routing throws the residual stream away and rebuilds the hidden state from a convex combination of sources:
# Replacement (Attention Residuals): the stream is overwritten
h = sum(alpha_i * v_i) # softmax weights sum to 1
# Additive (Delta Attention Residuals): the stream is preserved
h = h_tilde + sum(alpha_i * v_i) # routing augments, never replaces
Three consequences follow, and all three are about the residual stream, not the sources:
The safe-init property is the practical payoff: it's what lets you take an off-the-shelf pretrained checkpoint and fine-tune it into a Delta Attention Residual model with zero disruption at step 0.
From-scratch training on FineWeb-Edu, 10K steps, matched architecture/data/hyperparameters per scale. Validation perplexity (lower is better):
| Scale | Baseline | AttnRes (replace) | Full AttnRes (replace) | Delta Block (add) | Delta AttnRes (add) |
|---|---|---|---|---|---|
| 220M | 38.71 | 37.39 | 37.30 | 37.08 | 36.83 |
| 533M | 32.00 | 31.75 | 31.68 | 31.16 | 31.05 |
| 1044M | 29.70 | 31.76 (+6.9%) | 33.36 (+12.3%) | 29.19 | 29.13 |
| 7.57B | 17.43 | 18.58 (+6.6%) | — | 16.00 (−8.2%) | — |
This is the cleanest argument for the two-axis view: holding the source fixed (block deltas) and flipping only the routing (replacement → additive) converts a method that degrades into one that improves.
Routing collapse is still real — we do observe the max softmax weight fall to ${\sim}0.2$ in deep layers for the replacement baselines, while our additive delta routing holds ${\sim}0.6$ (1.8× higher average max weight, 0.62 vs 0.35). But we now read it as a symptom rather than the root cause. Redundancy among routing candidates lowers the contrast the softmax can express; resetting and replacing the stream compounds the damage at depth. Additive routing sidesteps both, which is why the routing stays sharp and the loss stays low.
Delta Block is the practical default. At 7.57B it adds 589.8K routing parameters (0.008%) and ~3% memory. It's also faster and lighter than Attention Residuals (14.0k vs 12.5k tok/s, 42.7 vs 44.0 GB) precisely because additive routing avoids the costly hidden-state replacement and reset operations.
# Train from scratch (Delta Block, recommended)
torchrun --nproc_per_node=8 train_scratch.py --mode delta_block --num_blocks 4
# Same-source ablation: delta sources, but replacement routing
torchrun --nproc_per_node=8 train_scratch.py --mode delta_replace_block --num_blocks 4
# Convert a pretrained checkpoint via fine-tuning (additive, zero-init)
torchrun --nproc_per_node=8 train_finetune.py --mode delta_block
# Evaluate downstream
python eval_downstream.py --model_path output/.../final --mode delta_block --tasks paper
@article{luo2026delta,
title={Delta Attention Residuals},
author={Cheng Luo and Zefan Cai and Junjie Hu},
url={https://github.com/wdlctc/delta-attention-residuals-code},
year={2026}
}
@article{kimi2025attention,
title={Attention Residuals},
author={Kimi Team},
journal={arXiv preprint arXiv:2603.15031},
year={2025}
}