Chapter 7: The transformer end to end

In Chapter 6 we built attention from scratch. In this chapter we wrap it into a full transformer block, stack the blocks into a transformer, and answer the architectural questions you will be asked in interviews:

What are the three transformer architectures, and why did decoder-only win?
Why pre-norm instead of post-norm?
LayerNorm vs RMSNorm — what’s the difference and why does anyone care?
What is a position encoding, and why is RoPE everywhere now?
What is the “residual stream,” and why is it the right mental model?

By the end of this chapter you will be able to read any modern open-source LLM’s code and immediately recognize what each block is doing.

Outline:

The transformer block, end to end.
Pre-norm vs post-norm — and why pre-norm won.
LayerNorm and RMSNorm.
The FFN: from 4d to SwiGLU and the parameter count.
The residual stream as the central mental model.
Position encodings: absolute, learned, sinusoidal, RoPE.
Stacking blocks: a full transformer.
The three architectures: encoder, decoder, encoder-decoder.
Why decoder-only won.
The full pseudo-code of a Llama-style transformer.

7.1 The transformer block

A transformer block is a sequence-to-sequence operation that takes a tensor of shape (N, S, D) and produces another tensor of shape (N, S, D). The block has two sub-operations:

Multi-head self-attention — the operation we built in Chapter 6.
A feed-forward network (FFN) — a small two- or three-layer MLP applied independently to each position.

Both sub-operations are wrapped in a residual connection and preceded by a normalization layer:

def transformer_block(x):
    x = x + attention(rmsnorm(x))   # attention sub-block
    x = x + ffn(rmsnorm(x))         # FFN sub-block
    return x

That’s the whole block.

Every transformer block is two read-then-write operations on the residual stream: attention moves information between positions, FFN processes each position independently — neither changes the stream's shape.

Modern blocks have a few more details (RoPE on the queries and keys inside `attention`, SwiGLU in the FFN), but the structural skeleton is exactly these four lines. Read them again. Every block in every modern LLM is some elaboration of this template.

The two sub-blocks have very different roles:

Attention is the “communication” step. It moves information between positions. Each token gets to look at every other token (or every leftward token for a causal model) and aggregate context.
The FFN is the “computation” step. It is applied independently to each position with no communication between positions. Its job is to take the contextualized representation produced by attention and process it nonlinearly.

This division of labor is one of the reasons transformers are easy to reason about: the only mechanism by which information moves across the sequence is attention. If you want to ask “how does this token’s representation depend on that token’s input,” the answer is always “through some attention head.” This is also the foundation of the entire field of mechanistic interpretability, which traces information flow through residual streams and attention heads.

7.2 Pre-norm vs post-norm

The original 2017 transformer applied normalization after each sub-block:

# Post-norm (original)
x = layer_norm(x + attention(x))
x = layer_norm(x + ffn(x))

Modern transformers (GPT-2 onward, Llama, Mistral, everything) apply it before each sub-block:

# Pre-norm (modern)
x = x + attention(layer_norm(x))
x = x + ffn(layer_norm(x))

The difference looks tiny but it matters a lot for training stability. With post-norm, the signal that flows through the residual stream is whatever the normalization layer produces, so the residual is repeatedly re-centered and re-scaled by the normalization. The norm of the activations doesn’t grow predictably, gradients are unstable, and the network is hard to train at depth without a careful warmup schedule.

With pre-norm, the residual stream is never normalized. Each sub-block reads a normalized version of the residual stream, computes an update, and adds it back. The residual stream itself just keeps growing — its norm increases roughly linearly with depth — but the inputs to each sub-block are always well-conditioned. This is much more stable, especially at depth.

The cost of pre-norm is that the residual stream’s norm grows over depth, which means you need a final normalization layer at the very end of the network (after the last block, before the output projection) to bring things back to a reasonable scale. Modern code always has this final norm. You’ll see it in every Llama implementation.

The pre-norm vs post-norm question is a favorite interview prompt, and the right answer is “pre-norm wins because it preserves a clean residual path that gradients can flow through, which is essential for training deep networks without elaborate learning-rate warmup.”

7.3 LayerNorm and RMSNorm

LayerNorm (Ba, Kiros & Hinton 2016) normalizes a vector x of shape (d,) by computing its mean and variance and re-centering and re-scaling:

mean = (1/d) Σ_i x_i
var  = (1/d) Σ_i (x_i - mean)^2
x_normalized = (x - mean) / sqrt(var + ε)
y = γ ⊙ x_normalized + β

γ and β are learned per-feature scale and shift parameters of shape (d,). The ε is a small constant (typically 1e-5) to prevent division by zero.

LayerNorm operates per token, independently across the batch and the sequence dimension. Two tokens in the same sequence are normalized independently of each other; they have no shared statistics. This is critical: LayerNorm introduces zero cross-token information leakage, which means it doesn’t break the causal property of the attention.

RMSNorm (Zhang & Sennrich 2019) is LayerNorm minus the mean-centering:

rms = sqrt((1/d) Σ_i x_i^2 + ε)
y = γ ⊙ (x / rms)

That’s it. No mean subtraction, no learned bias, just root-mean-square scaling and a learned gain. The result has unit RMS rather than unit variance.

Why drop the mean centering? Two reasons:

Empirical: in practice it makes essentially no difference in quality. Training and eval losses match LayerNorm closely.
Operational: RMSNorm is cheaper. You skip one mean reduction, one subtraction, and the bias parameter. It’s not a huge speedup on its own (LayerNorm is not a bottleneck), but it adds up across hundreds of layers, and the kernel fuses better.

Llama, Mistral, Gemma, Qwen, DeepSeek, and most modern open LLMs use RMSNorm. The original transformer used LayerNorm. The change is one of those “this is just better, no real downside” simplifications that the community adopted en masse around 2022.

What gets normalized

In a modern pre-norm transformer block, the normalization is applied to the residual stream before each sub-block:

x = x + attention(rmsnorm_pre_attn(x))
x = x + ffn(rmsnorm_pre_ffn(x))

There is one RMSNorm per sub-block, with its own learned γ. So a transformer with L layers has 2L RMSNorms in its blocks, plus one final RMSNorm at the very end of the stack (before the output projection). For a 32-layer Llama, that’s 65 RMSNorms total.

Each γ is a vector of shape (d_model,), so the total parameter count of all RMSNorms is 65 × d_model. For d_model = 4096, that’s about 266k parameters — vanishingly small compared to the FFN and attention weights, which are billions. RMSNorm is essentially free in parameter count.

7.4 The feed-forward network

Each transformer block has a small feed-forward network applied independently to every position. The original transformer’s FFN was:

def ffn_classic(x):
    return W_2 @ gelu(W_1 @ x + b_1) + b_2

That’s two linear layers with a GELU activation in between. The hidden dimension is conventionally 4× the model dimension (d_ffn = 4 d_model). So for d_model = 4096, the FFN has hidden dim 16384. The first linear is (d_model → 4 d_model), the second is (4 d_model → d_model). The parameter count is 2 × d_model × 4 d_model + biases = 8 d_model² + small.

This dwarfs the attention parameters. For a typical block, the attention has 4 d_model² parameters (Q, K, V, and output projections, each d_model × d_model) and the FFN has 8 d_model². The FFN is roughly two-thirds of the parameters in every transformer block. Whatever you do to optimize a transformer at scale, the FFN is where most of the weight lives.

The SwiGLU upgrade

Modern transformers (Llama, Mistral, etc.) use SwiGLU instead of the classic two-layer GELU FFN. SwiGLU has three linear layers and a SiLU gate:

def ffn_swiglu(x):
    gate = W_gate @ x       # (d_model → d_ffn)
    up   = W_up @ x         # (d_model → d_ffn)
    down = W_down @ (silu(gate) * up)   # (d_ffn → d_model)
    return down

The FFN now has three projections: a “gate” projection through SiLU, an “up” projection that gets element-wise multiplied with the gated activation, and a “down” projection that brings the result back to model dimension. The gating mechanism gives the model a way to learn “which features to amplify and which to suppress” at each position.

The parameter count is now 3 × d_model × d_ffn instead of 2 × d_model × d_ffn. To keep the total parameter count of the FFN comparable to the classic 4× version, modern transformers use d_ffn ≈ 2.67 × d_model instead of 4 × d_model. The 2.67 ≈ 8/3 is chosen so that 3 × d_model × (8/3) d_model = 8 d_model², matching the classic count.

Empirical result: SwiGLU at the same parameter count is consistently better than the classic FFN. It is the FFN you will see in every modern open LLM. The exact reason is debated; the empirical result is not.

7.5 The residual stream

Now for the most useful mental model in modern transformer interpretation. Recall that every sub-block is wrapped in x = x + sub_block(norm(x)). This means there is a single tensor x of shape (N, S, D) that flows through the entire network from the input embedding to the final output.

Every sub-block writes a small update (highlighted arrows) into the shared stream — the stream is never reset or normalized mid-block, which is why pre-norm works and why the final RMSNorm before the LM head is mandatory.

Each layer reads it, computes an update, and adds the update back into it. The tensor itself is never replaced or normalized — it just accumulates contributions from every layer.

This tensor is called the residual stream. The mental model is:

The residual stream is a “shared workspace” of width D at every position.
Each attention sub-block reads from the workspace, computes an update based on cross-position information, and writes the update back.
Each FFN sub-block reads from the workspace, computes a per-position nonlinear update, and writes the update back.
The output projection at the end reads the final state of the workspace and projects it to whatever the task requires (vocabulary logits for an LLM, class logits for a classifier).

The residual stream framing is the foundation of mechanistic interpretability research. When researchers ask “where does this model store the fact that the answer is Paris?”, they’re literally asking “in which dimension of the residual stream, at which position, after which layer.” The streams in the input embedding, write into the stream from each layer, and read out at the end picture is much more useful than thinking of layers as a stack of opaque transformations.

It also explains why residual connections matter so much: they’re not a regularization trick, they’re the actual data structure the model is operating on. Removing them doesn’t just make training harder — it removes the only object the model is actually working with.

7.6 Position encodings

Attention is permutation-equivariant: if you shuffle the tokens in the input, you get the corresponding shuffle of the outputs. This is a problem, because in language, order matters: "dog bites man" and "man bites dog" are not the same.

The fix is to inject position information into the input. There are four families of position encoding you’ll meet.

Absolute sinusoidal (the original transformer)

The 2017 paper added a fixed (non-learned) sinusoidal pattern to the input embeddings:

PE[pos, 2i]   = sin(pos / 10000^(2i / d_model))
PE[pos, 2i+1] = cos(pos / 10000^(2i / d_model))

The reason for the sin and cos of varying frequencies is that the encoding can represent any position uniquely, and (because of trig identities) the model can learn to compute the relative position between two tokens by linear operations on their absolute encodings. It’s a clever construction and it worked for the original transformer.

It is essentially never used in modern LLMs.

Learned absolute (BERT, GPT-2)

Replace the sinusoidal pattern with a learned embedding table of shape (max_position, d_model). The model learns whatever positional pattern works best. Simple, effective, and the standard in BERT and GPT-2.

Downside: the embedding is fixed at training time to a maximum sequence length. To extend the context window after training, you have to either retrain or use clever tricks. Not future-proof.

Relative position biases (T5)

Instead of adding a position vector to the input, add a bias to the attention scores based on the relative distance between positions:

scores[i, j] += relative_bias(j - i)

The bias is a learned function of the distance. Relative-position biases sidestep the “max sequence length” problem somewhat, and they were used in T5 and a couple of follow-ups. They’ve mostly been displaced by RoPE.

RoPE — Rotary Position Embedding

RoPE (Su et al., 2021) is the position encoding used by Llama, Mistral, GPT-NeoX, Qwen, DeepSeek, and most modern open LLMs. It’s the one you need to understand.

The intuition: instead of adding a position vector to the input, RoPE rotates the query and key vectors by a position-dependent angle inside the attention block. Specifically, in each attention head’s query and key, you split the d_h dimensions into pairs, treat each pair as a 2D vector, and rotate that 2D vector by an angle proportional to the position:

q_rotated[i] = R(θ * pos) @ q[i]
k_rotated[i] = R(θ * pos) @ k[i]

where R(α) is a 2D rotation matrix and the angle θ depends on the dimension index (smaller for higher dimensions, larger for lower dimensions, similar to the sinusoidal encoding’s frequency schedule).

The magic is in the dot product. After rotation:

q_rotated[i] · k_rotated[j] = q[i] @ R(θ * (pos_i - pos_j)) @ k[j]

The dot product depends only on the relative position pos_i - pos_j, not on the absolute positions. This means RoPE has both the simplicity of an absolute encoding (you compute it once per position, you don’t need a special attention bias) and the generalization power of a relative encoding (the same (query, key, distance) triple gives the same attention score regardless of where in the sequence it appears).

Two more reasons RoPE won:

It’s free to extend. To use a model with a longer context than it was trained on, you can just increase the maximum position you compute for. Quality degrades, but it doesn’t crash. (Better extension techniques like YaRN, ABF, and PI exist — Chapter 35.)
It only touches the attention block. RoPE rotates Q and K inside attention; the FFN and the residual stream don’t know it exists. This makes it modular.

You will see RoPE in every modern open LLM. The implementation is a small function called inside forward() in the attention module. We will revisit RoPE in Chapter 35 when we cover long-context extension.

7.7 Stacking blocks

A full transformer is just L of these blocks stacked, with an embedding layer at the input and a projection layer at the output:

class Transformer(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers, n_heads, max_seq_len):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads) for _ in range(n_layers)
        ])
        self.norm_final = RMSNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, input_ids):
        x = self.embed(input_ids)               # (N, S, D)
        for block in self.blocks:
            x = block(x)
        x = self.norm_final(x)                  # (N, S, D)
        logits = self.lm_head(x)                # (N, S, vocab_size)
        return logits

That’s the entire transformer. Twenty lines. The block is the unit; the rest is just stacking.

A few production conventions:

Tied embeddings. Some models reuse the input embedding matrix as the output projection (lm_head.weight = embed.weight). This saves vocab_size × d_model parameters — a few percent of the total. Llama 1 and 2 tied; Llama 3 untied. The choice doesn’t affect quality much.
The final norm matters. Without the norm_final, the residual stream’s growing norm (a property of pre-norm) makes the output projection produce poorly scaled logits. The final norm is not optional in a pre-norm transformer.
The output projection has no bias. This is the standard convention. Bias on the LM head doesn’t help and adds unnecessary parameters.

7.8 The three architectures

The original transformer paper proposed an encoder-decoder model for machine translation. Subsequent work split it into encoder-only and decoder-only variants. The three families:

Encoder-only (BERT, RoBERTa, DeBERTa)

The model takes a fixed-length input, attends bidirectionally (no causal mask), and produces a contextualized representation per position. There is no autoregressive generation. The model is trained with masked language modeling (MLM): mask out a random 15% of the input tokens and ask the model to predict them.

Encoder-only models are the right choice for understanding tasks: classification, named entity recognition, sentence-pair similarity, search retrieval, embeddings. They are not generative.

Embedding models (Chapter 9) are encoder-only.

Encoder-decoder (T5, BART, mT5, NLLB)

Two stacks: an encoder that reads the input bidirectionally, and a decoder that generates the output autoregressively, with cross-attention from the decoder to the encoder’s output. Trained on sequence-to-sequence tasks: translation, summarization, span infilling.

Encoder-decoder is natural for tasks where the input and output are clearly distinct (English text in, French text out). It was the dominant architecture for translation and summarization for a few years. T5 was the most influential example.

Decoder-only (GPT, Llama, Mistral, almost everything modern)

A single stack with a causal mask. The model is trained on next-token prediction over a continuous stream of text — no encoder, no cross-attention. Generation is one forward pass per output token. This is the architecture of every modern LLM you’ve heard of.

Why did decoder-only win?

7.9 Why decoder-only won

Three reasons.

Decoder-only won because a single next-token-prediction objective on any text covers all tasks, and the causal mask makes the KV cache possible — both the encoder and encoder-decoder architectures require separate heads or re-encoding for each task.

(1) Universality of the next-token-prediction task. A decoder-only model trained on next-token prediction can do classification, translation, summarization, and dialogue all with the same architecture and the same objective — you just frame the task as a prompt. The flexibility is enormous. Encoder-decoder models, by contrast, were built for one task at a time.

(2) Scaling cleanly. The next-token-prediction objective gives you a single, clean loss that scales with data. Every byte of text in the training set contributes a gradient signal. There’s no architectural decision about how to handle “the input vs the output.” Everything is just text. This made decoder-only the natural fit for the “scale is all you need” era.

(3) The KV cache. A decoder-only model with causal attention has the property that K and V for past tokens never change as new tokens are generated. This makes autoregressive generation efficient: you can cache all the K and V vectors from the prompt and only compute one new K/V per output token. Encoder-decoder models also have a KV cache for the decoder side, but the encoder has to be re-run from scratch each time the input changes. For chat applications where the input changes turn-by-turn, this is awkward. Decoder-only with KV cache + prefix sharing (Chapter 29) is the architectural pattern that makes long conversations cheap.

There are still active research directions in encoder-decoder models, especially for translation (NLLB, NLLB-200). But the dominant architecture for general-purpose LLMs is decoder-only, and that is unlikely to change.

7.10 The full pseudo-code of a Llama-style transformer

Putting everything together. This is essentially the architecture of Llama 2 / Llama 3, simplified to the bones:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def rmsnorm(x, weight, eps=1e-5):
    rms = torch.sqrt((x * x).mean(-1, keepdim=True) + eps)
    return weight * (x / rms)

def rotate_half(x):
    x1, x2 = x.chunk(2, dim=-1)
    return torch.cat([-x2, x1], dim=-1)

def apply_rope(q, k, cos, sin):
    q_rot = q * cos + rotate_half(q) * sin
    k_rot = k * cos + rotate_half(k) * sin
    return q_rot, k_rot

class Attention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_h = d_model // n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.out = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x, cos, sin):
        N, S, D = x.shape
        H, D_h = self.n_heads, self.d_h
        qkv = self.qkv(x).view(N, S, 3, H, D_h)
        q, k, v = qkv.unbind(dim=2)                       # each (N, S, H, D_h)
        q, k = apply_rope(q, k, cos, sin)
        q = q.transpose(1, 2)                              # (N, H, S, D_h)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)
        scores = q @ k.transpose(-2, -1) / math.sqrt(D_h)
        causal_mask = torch.triu(torch.full((S, S), float('-inf')), diagonal=1).to(x.device)
        scores = scores + causal_mask
        attn = scores.softmax(dim=-1)
        out = attn @ v                                     # (N, H, S, D_h)
        out = out.transpose(1, 2).contiguous().view(N, S, D)
        return self.out(out)

class SwiGLUFFN(nn.Module):
    def __init__(self, d_model, d_ffn):
        super().__init__()
        self.gate = nn.Linear(d_model, d_ffn, bias=False)
        self.up   = nn.Linear(d_model, d_ffn, bias=False)
        self.down = nn.Linear(d_ffn, d_model, bias=False)

    def forward(self, x):
        return self.down(F.silu(self.gate(x)) * self.up(x))

class Block(nn.Module):
    def __init__(self, d_model, n_heads, d_ffn):
        super().__init__()
        self.norm1 = nn.Parameter(torch.ones(d_model))
        self.attn  = Attention(d_model, n_heads)
        self.norm2 = nn.Parameter(torch.ones(d_model))
        self.ffn   = SwiGLUFFN(d_model, d_ffn)

    def forward(self, x, cos, sin):
        x = x + self.attn(rmsnorm(x, self.norm1), cos, sin)
        x = x + self.ffn(rmsnorm(x, self.norm2))
        return x

class LlamaLike(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers, n_heads, d_ffn, max_seq_len):
        super().__init__()
        self.embed   = nn.Embedding(vocab_size, d_model)
        self.blocks  = nn.ModuleList([Block(d_model, n_heads, d_ffn) for _ in range(n_layers)])
        self.norm    = nn.Parameter(torch.ones(d_model))
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        # RoPE precompute would live here in real code
        self.cos = ...
        self.sin = ...

    def forward(self, input_ids):
        x = self.embed(input_ids)
        cos, sin = self.cos[:input_ids.size(1)], self.sin[:input_ids.size(1)]
        for block in self.blocks:
            x = block(x, cos, sin)
        x = rmsnorm(x, self.norm)
        return self.lm_head(x)

Read this code top to bottom. Every modern open LLM is a small variation of these 60 lines: maybe GQA instead of MHA, maybe MoE instead of dense FFN, maybe a slightly different RoPE schedule. The skeleton is identical. If you can write this from memory, you can read any LLM source and not feel lost.

7.11 The mental model

Eight points to take into Chapter 8:

A transformer block is attention + FFN + residuals + pre-norms. Four lines.
Pre-norm wins because it leaves the residual stream untouched.
RMSNorm replaces LayerNorm in modern models because it’s cheaper and just as good.
The FFN is two-thirds of the parameters. SwiGLU is the modern variant.
The residual stream is the actual object the network is working with — every sub-block is a read-then-write into it.
RoPE is the position encoding of every modern open LLM. It rotates Q and K by a position-dependent angle.
Three architectures exist: encoder, encoder-decoder, decoder-only. Decoder-only won because of universality, scaling, and the KV cache.
The full transformer is 60 lines — I just wrote them. Read them until the structure feels obvious.

In Chapter 8 we use this transformer to actually generate text — the autoregressive decoding loop, sampling, and the generation parameters that control the output.

Read it yourself

Vaswani et al., Attention Is All You Need (2017) — the original architecture, with post-norm and sinusoidal position encoding.
Touvron et al., Llama 2: Open Foundation and Fine-Tuned Chat Models (2023) — read sections 2 (architecture) carefully. Modern reference.
Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding (2021) — the RoPE paper.
Anthropic, A Mathematical Framework for Transformer Circuits — the residual-stream interpretability framing.
Andrej Karpathy’s nanoGPT GitHub repo — the cleanest minimal transformer in PyTorch. Read it cover to cover.

Practice

Why does pre-norm let the gradient flow through the residual stream more cleanly than post-norm? Trace one backward pass through both arrangements and identify where the normalization sits.
Implement RMSNorm in 5 lines of PyTorch from scratch. Verify against torch.nn.RMSNorm (PyTorch 2.4+) on a random input.
The SwiGLU FFN has three linear layers with hidden dim d_ffn = (8/3) d_model, while the classic FFN has two with d_ffn = 4 d_model. Show that the parameter counts match.
Read nanoGPT/model.py cover to cover. For every parameter and every line of forward, identify which equation in this chapter it corresponds to.
Why does RoPE only need to be applied to the queries and keys, not to the values? (Hint: think about which dot product produces the attention scores.)
Why is decoder-only the right architecture for an LLM-as-chatbot, and what specific property of causal attention makes the KV cache work? Walk through one round of conversation in your head.
Stretch: Take the LlamaLike class above, fill in the RoPE precomputation, train it on a small text corpus (the tinyShakespeare dataset is the classic choice), and verify it generates coherent-ish text after 30 minutes of training on a single GPU.

Concept check