Open Attention Residuals: Replacing Additive Residuals with Learned Cross-Layer Attention

Cheng Luo, Zefan Cai — March 2025

TL;DR

We provide an open-source implementation of Attention Residuals (Kimi Team, 2025) and systematically evaluate them on Qwen3-architecture models. Standard transformers use simple additive residual connections; Attention Residuals replace these with a learned softmax attention over previous layer representations, allowing each layer to selectively retrieve information from any earlier layer.

Key result: Training a 0.6B-parameter model from scratch, Block Attention Residuals reduce training loss by 0.058 and WikiText-2 perplexity by 7.7%, adding only 0.03% extra parameters.

What Are Attention Residuals?

In a standard transformer, the residual connection at each layer is a simple addition:

h = h + Attention(Norm(h))
h = h + MLP(Norm(h))

Every layer can only see the cumulative sum of all previous layers' outputs. A deep layer cannot selectively access the representation from, say, layer 3 without also including the modifications from layers 4 through N-1.

Attention Residuals replace this additive shortcut with a learned depth-wise attention mechanism. Layers are grouped into blocks, and before each sublayer, the model attends over all previous block representations to decide what information to retrieve:

def block_attn_res(blocks, partial_block, proj, norm):
    V = torch.stack(blocks + [partial_block])     # (N+1, B, T, D)
    K = norm(V)                                    # RMSNorm keys
    query = proj.weight.view(-1)                   # learned query (D,)
    logits = einsum("d, n b t d -> n b t", query, K)
    weights = softmax(logits, dim=0)               # (N+1, B, T)
    return einsum("n b t, n b t d -> b t d", weights, V)

This allows each layer to selectively attend to specific earlier blocks — similar to how standard attention operates over token positions, but applied across the network's depth.

Two Modes: Block vs Full

ModeSourcesDescription
Block AttnResN blocks (~4-8)Layers grouped into blocks; attend over block-level summaries
Full AttnResAll sublayer outputsEvery sublayer output is a source; finest-grained routing

Experiment 1: Training from Scratch

We train a ~100M model (d=512, L=12) from scratch on FineWeb-Edu for 20k steps with identical hyperparameters across three variants: standard residual baseline, Block AttnRes (4 blocks), and Full AttnRes (per-sublayer).

Training loss curves: baseline vs block vs full AttnRes
ModelTrain LossWikiText-2 PPLLAMBADA AccHellaSwag Acc
Baseline3.52376.760.0760.315
Block AttnRes3.48970.820.0840.340
Full AttnRes3.50272.700.1020.305
Block AttnRes reduces WikiText-2 perplexity by 7.7% (76.76 → 70.82) with only 0.03% additional parameters. Full AttnRes also improves but is less effective at this scale.

We also ran this experiment at 0.6B scale (d=1024, L=28, same architecture as Qwen3-0.6B):

Model (0.6B, 20k steps)Train LossImprovement
Baseline3.303
Block AttnRes (8 blocks)3.245-0.058

Why Block Beats Full

Block AttnRes consistently outperforms Full AttnRes. The reason: with 4 blocks of 3 layers each, each block accumulates multiple layers of computation, creating meaningfully distinctive representations to attend over. Full AttnRes has many individual sublayer outputs that differ by only one small residual update — the signal-to-noise ratio for the softmax routing is worse.

Visualizing Layer Dependencies

We visualize the learned softmax weights to see how layers route information across depth. Each row is a sublayer (Attn/MLP alternating), each column is a source (earlier block or sublayer output). Brighter = higher attention weight.

Block AttnRes (4 blocks, trained from scratch)

Block AttnRes layer dependencies

Rich cross-block attention patterns: layers selectively attend to specific earlier blocks, not just the most recent one. The embedding (Block 0) receives significant attention from many layers.

Full AttnRes (per-sublayer, trained from scratch)

Full AttnRes layer dependencies

Smooth triangular growth (1 source per sublayer). The model learns genuine cross-layer connections but the attention is more diffuse across the many similar sources.

Experiment 2: Fine-tuning Pretrained Models

We also applied AttnRes to pretrained Qwen3 models (0.6B and 1.7B) via continued pretraining on FineWeb-Edu for 10k steps. This is a harder setting because the pretrained weights are already optimized for standard residual connections.

ModelMethodFinal Lossvs Baseline
Qwen3-0.6BBaseline (continued pretraining)2.571
Qwen3-0.6BBlock AttnRes (bias=0 frozen)2.550-0.021
Qwen3-0.6BFull AttnRes (bias=0 frozen)2.549-0.022
Qwen3-1.7BBaseline2.343
Qwen3-1.7BBlock AttnRes (bias=3 learnable)2.340-0.003
Fine-tuning gains are much smaller (-0.02 vs -0.06 from scratch). The pretrained model's representations are already committed to standard residual flow. When we visualize the learned weights, the attention collapses to the diagonal (most recent layer), meaning the model barely uses cross-layer routing.
Fine-tuning loss curves comparison

What We Tried to Improve Fine-tuning

ApproachResult
Learnable recency bias (init=3, 5, 10)Model increases bias during training, escaping back to standard residual
Sigmoid scalar gate (init=-10)Gate stuck at 0 due to sigmoid saturation
Frozen base + AttnRes onlyLoss 2.75 — AttnRes alone can't compensate for disrupted inputs
LoRA + AttnResLoss 2.57 — LoRA too constrained for deep co-adaptation
Knowledge DistillationKL loss exploded due to uniform-attention initial disruption
Zero-init queries (paper default)Best result: -0.022 loss improvement

The conclusion: for fine-tuning, just use the paper's zero-init. For the best results, train from scratch.

Lessons Learned

  1. Train from scratch for maximum benefit. AttnRes shows 3x larger improvements when the model co-evolves with cross-layer routing from step 0.
  2. Block mode > Full mode at small-to-medium scale. Fewer, more distinctive block representations are easier to route than many nearly-identical sublayer outputs.
  3. Zero-init queries work best. The paper's default initialization (all projection weights = 0 → uniform softmax) outperforms all our alternatives: recency bias, sigmoid gates, LoRA co-adaptation, and knowledge distillation.
  4. Don't give the model an escape hatch. When a learnable parameter allows recovering standard residual behavior, the model will use it. Freeze the bias at zero to force cross-layer learning.
  5. AttnRes adds negligible overhead. 0.03% parameters, <2% latency. The cost is almost free.

Quick Start

# Install
pip install -r requirements.txt

# Train from scratch (Block AttnRes, recommended)
torchrun --nproc_per_node=8 train.py --mode block --num_blocks 4

# Evaluate
python eval.py --model_path output/scratch-block-d512-L12-20k/final --mode block

# Interactive visualization
python app.py --model_path output/scratch-block-d512-L12-20k/final --mode block

Pretrained Weights

ModelModeHuggingFace
100M Baselinewdlctc/open-attnres-baseline
100M Block AttnRes4 blockswdlctc/open-attnres-block
100M Full AttnResper-sublayerwdlctc/open-attnres-full

Citation

@software{luo2025openattnres,
  title={Open Attention Residuals},
  author={Cheng Luo and Zefan Cai},
  url={https://github.com/wdlctc/open-attention-residuals},
  year={2025}
}

@article{kimi2025attention,
  title={Attention Residuals},
  author={Kimi Team},
  journal={arXiv preprint arXiv:2603.15031},
  year={2025}
}