Chapter 21: Prefill vs decode: the two-phase nature of LLM inference

This is the opening chapter of Part III, and it’s the single most important conceptual division you have to internalize for inference. Once you see prefill and decode as different workloads with different bottlenecks, every later chapter becomes natural: continuous batching, PagedAttention, FlashAttention, KV cache, prefix caching, autoscaling, disaggregation — they all exist to optimize one phase or the other, often by exploiting their differences.

We previewed this in Chapter 8 (the decoding loop). This chapter goes deep on the why: why are these two phases so different, what’s the math behind the difference, and what are the operational consequences? By the end you’ll be able to:

Explain the prefill/decode asymmetry to a non-ML engineer in two minutes.
Compute the arithmetic intensity of each phase and plot both on a GPU roofline.
Predict whether a workload is prefill-bound or decode-bound from its prompt/completion length distribution.
Justify every later optimization in this Part as targeting one of the two phases.

Outline:

The story so far.
Prefill mechanically — what happens during the first forward pass.
Decode mechanically — what happens during each subsequent token.
Arithmetic intensity from first principles.
The roofline model.
Where prefill and decode sit on the roofline.
The continuous-batching consequence.
The hardware-choice consequence.
The disaggregation consequence.
Time-to-first-token vs time-per-output-token.

21.1 The story so far

In Chapter 8 we walked through the autoregressive generation loop:

tokens = prompt_ids
for step in range(max_new_tokens):
    logits = model(tokens)              # forward pass
    next_token = sample(logits[-1])     # pick one token
    tokens = torch.cat([tokens, next_token])
    if next_token == EOS:
        break

We noted then that “the first forward pass and every subsequent one are completely different workloads.” This chapter unpacks why.

The naive code above hides the asymmetry by doing the same thing every step. In production serving stacks (vLLM, SGLang, TensorRT-LLM), the loop is restructured so that the first forward pass processes the entire prompt at once (prefill), and each subsequent forward pass processes only one new token (decode). The KV cache (Chapter 22) makes this restructuring possible by saving the K and V vectors from prefill so decode doesn’t have to recompute them.

The result is two phases with different shapes:

Prefill: one transformer forward pass over S_prompt tokens. Big tensors, lots of arithmetic, parallelizable across positions.
Decode: one transformer forward pass per output token. Each pass operates on a single new position. Tiny tensors, very little arithmetic, but the model has to read all its weights from HBM regardless.

The compute and memory profiles of these two phases are wildly different, and that single fact drives 80% of inference design.

Prefill and decode have fundamentally different input shapes — one big parallel pass vs. one sequential matrix-vector multiply — which is why their GPU bottlenecks are opposite.

21.2 Prefill mechanically

Walk through what prefill actually does. The model is a stack of L transformer blocks, each with attention and FFN. The input is a prompt of length S_prompt.

For each block:

Attention. Compute Q, K, V projections of shape (S_prompt, d_model). Reshape to (S_prompt, H, d_h). Compute the attention scores Q K^T / √d_h of shape (H, S_prompt, S_prompt). Apply causal mask. Softmax. Multiply by V. Project back to (S_prompt, d_model). Save K and V to the cache so decode can use them.

FFN. Apply the SwiGLU FFN: gate, up, silu, multiply, down. Output shape (S_prompt, d_model).

The sizes:

Linear layers (Q, K, V, O, gate, up, down): each does a matmul (S_prompt, d_model) @ (d_model, d_*). The matmuls have shapes like (S_prompt, d_model) → (S_prompt, d_model) for QKVO, and (S_prompt, d_model) → (S_prompt, d_ffn) for the FFN.
Attention scores: Q @ K^T is (S_prompt, d_h) @ (d_h, S_prompt) → (S_prompt, S_prompt) per head. The whole tensor is (H, S_prompt, S_prompt) per layer.

For a typical prompt (S_prompt = 1000), d_model = 4096, H = 32, d_ffn = 14336 — every matmul has at least one dimension in the thousands. Every matmul is large. The GPU can do useful arithmetic on big tensors, and the FLOPs/byte ratio is high enough that the GPU is compute-bound — the bottleneck is the actual arithmetic, not the memory access.

The total compute cost of prefill is approximately 2 × P × S_prompt FLOPs — the 6PD formula from Chapter 11 applies to training (forward + backward + optimizer); for inference (forward only) the factor is 2:

Prefill compute ≈ 2 × P × S_prompt FLOPs   (forward pass only)

For a 70B model with a 1000-token prompt:

2 × 70 × 10⁹ × 1000 = 1.4 × 10¹⁴ FLOPs = 140 TFLOPs of arithmetic

On an H100 doing 989 TFLOP/s in bf16, this takes:

140 / 989 ≈ 0.14 seconds

That’s the theoretical minimum prefill time. In practice you achieve maybe 30–50% of peak, so it’s more like 0.3–0.5 seconds for a 1000-token prefill on an H100. This is the dominant component of time-to-first-token. Every millisecond of TTFT in a production chat application is mostly prefill time.

21.3 Decode mechanically

Now for the decode side. After prefill, the KV cache has all the K and V vectors for positions 0..S_prompt-1. To generate token S_prompt (the first output token), you run a forward pass with only one new input token — the very last sampled one.

For each block:

Attention. Compute Q, K, V for the single new position. Q has shape (1, d_model). The new K and V are appended to the cache (cache shape: (S_so_far, d_model)). The attention computation is now:

Q is (1, d_h) per head.
K_full from the cache is (S_so_far, d_h) per head.
Scores: Q @ K_full^T is (1, d_h) @ (d_h, S_so_far) → (1, S_so_far) per head.
Softmax along the S_so_far axis.
attention @ V_full: (1, S_so_far) @ (S_so_far, d_h) → (1, d_h) per head.
Project back, store new K/V in the cache.

FFN. Apply the FFN to the single new position: (1, d_model) → (1, d_ffn) → (1, d_model). Tiny matmul.

The matmul shapes are now (1, d_in) → (1, d_out). This is a matrix-vector multiply, not a matrix-matrix multiply. The amount of arithmetic per layer is roughly 2 × (d_model² for QKVO + d_model × d_ffn for FFN gate/up/down), which is on the order of 10⁸ FLOPs per layer. For 80 layers, that’s ~10¹⁰ FLOPs per decode step — about 10 GFLOPs total for one new token.

Now compute how long this should take on an H100 doing 989 TFLOP/s:

10 GFLOP / 989 TFLOP/s = ~10 microseconds

So a decode step should take ~10μs of pure compute. But it doesn’t. It takes ~50ms in practice. That’s a 5000× discrepancy.

The reason is that the GPU is no longer compute-bound. The model still has to read all its weights from HBM to do those tiny matmuls. Reading the weights, not computing on them, dominates the time. A 70B model in bf16 has 140 GB of weights, and an H100 has ~3 TB/s of HBM bandwidth, so the minimum time to stream all the weights through the chip is:

140 GB / 3 TB/s ≈ 47 ms

That’s the floor on per-token latency for a 70B model on H100. The compute is irrelevant; the bottleneck is the rate at which you can move the weights from HBM to the compute units. Decode is memory-bandwidth-bound.

This is the central asymmetry. Prefill is compute-bound; decode is memory-bound. Every other observation about LLM inference flows from this.

21.4 Arithmetic intensity from first principles

To make the asymmetry rigorous, we use a concept from HPC called arithmetic intensity (AI):

arithmetic_intensity = FLOPs / bytes_moved

It’s the number of floating-point operations performed per byte of data transferred from memory. High AI means “spend lots of time computing per byte read” (compute-bound regime). Low AI means “spend lots of time waiting for bytes” (memory-bound regime).

For a generic matmul (M, K) @ (K, N) → (M, N):

FLOPs: 2 × M × K × N
Bytes (assuming bf16, 2 bytes per element): 2 × (M × K + K × N + M × N)

Arithmetic intensity:

AI = (2 × M × K × N) / (2 × (M × K + K × N + M × N))
   = (M × K × N) / (M × K + K × N + M × N)

Now plug in two cases.

Prefill matmul (large M = S_prompt, K = N = d_model):

M = 1000, K = N = 4096
AI = (1000 × 4096 × 4096) / (1000 × 4096 + 4096 × 4096 + 1000 × 4096)
≈ 1.68 × 10¹⁰ / 2.48 × 10⁷
≈ 678 FLOPs/byte

Decode matmul (M = 1, K = N = d_model):

M = 1, K = N = 4096
AI = (1 × 4096 × 4096) / (1 × 4096 + 4096 × 4096 + 1 × 4096)
≈ 1.68 × 10⁷ / 1.68 × 10⁷
≈ 1 FLOP/byte

The arithmetic intensity differs by ~700×. Prefill spends 678 FLOPs per byte read; decode spends 1 FLOP per byte read. That’s the entire story in one number.

For a fair calculation, the H100’s “compute-vs-memory crossover” arithmetic intensity is roughly:

peak_compute / peak_bandwidth = 989 TFLOP/s / 3 TB/s ≈ 330 FLOPs/byte

Above 330 FLOPs/byte, the GPU is compute-bound (you can’t compute fast enough to use up the bytes you’re reading). Below 330, it’s memory-bound (you’re waiting for bytes). Prefill at AI ≈ 678 is well above the line (compute-bound). Decode at AI ≈ 1 is hundreds of times below the line (memory-bound).

This is the formal version of “prefill and decode are different workloads.”

A single decode step runs at AI ≈ 1 — deeply memory-bound — while a prefill pass runs at AI ≈ 678 — firmly compute-bound. Different bottlenecks demand different optimizations.

21.5 The roofline model

The picture you’ll see in any GPU performance discussion is the roofline: a log-log plot with arithmetic intensity on the x-axis and achievable throughput on the y-axis. The plot has two regions:

Memory-bound region (left): the achievable throughput is AI × peak_bandwidth. Doubling AI doubles your throughput; doubling bandwidth doubles your throughput.
Compute-bound region (right): the achievable throughput is fixed at peak_compute. AI doesn’t matter once you’re past the crossover.

The two regions meet at the crossover point (AI = peak_compute / peak_bandwidth). On an H100 in bf16, that’s around 330 FLOPs/byte.

Throughput
   ^
   |          ____________  peak_compute (989 TFLOP/s)
   |        /
   |       /
   |      /
   |     /
   |    /
   |   /
   |  /
   |_/______________________________> Arithmetic Intensity
   AI=1     AI=330             AI=678
   (decode) (crossover)        (prefill)

Read that diagram. Decode sits at AI=1. It’s deep in the memory-bound region. Doubling the FLOPs of the chip would not make decode faster; doubling the bandwidth would. Prefill sits at AI=678. It’s in the compute-bound region. Doubling the FLOPs would make prefill faster; doubling the bandwidth would do nothing.

The roofline model exposes the prefill/decode divide: decode is hundreds of times below the compute-bandwidth crossover, so only bandwidth improvements help it.

This single picture is the most important diagram in LLM serving. It tells you:

What the bottleneck is for each phase.
Which optimizations can possibly help (memory-side for decode, compute-side for prefill).
What hardware to want for each (high HBM bandwidth for decode-heavy, high FLOPs for prefill-heavy).

21.6 The continuous-batching consequence

Here’s the operational consequence that the next chapter (KV cache) and the chapter after that (continuous batching) both build on.

For decode, the matmul shapes are (1, d_in) → (1, d_out). The “1” is the batch dimension — one token from one user. The matmul is so small that the GPU is essentially idle except for the weight read. You’re paying the cost of reading 140 GB of weights to do ~10 GFLOPs of arithmetic on a tensor with one row.

What happens if you stack many users’ decode steps together? The matmul becomes (N, d_in) → (N, d_out) where N is the number of concurrent users in the batch. The arithmetic scales linearly with N, but the weight read is the same — you only have to read the weights once per layer, regardless of batch size. So the arithmetic intensity scales with N:

AI(N) ≈ N FLOPs/byte

If you batch 64 concurrent users, decode AI is now ~64. Still well below the compute-bound crossover (330) but vastly better than the AI=1 of a single user. Throughput improves nearly linearly with batch size, all the way until you saturate either the compute or the HBM capacity.

This is why continuous batching exists (Chapter 23). It’s the technique that lets you stuff together decode steps from many concurrent users so the GPU does useful work instead of waiting for HBM. Without continuous batching, decode would be hopelessly slow. With it, decode throughput on a 70B model can reach hundreds of tokens per second total, even though each individual user gets ~20 tokens/sec.

The corresponding question for prefill is less interesting: prefill is already at AI=678, so batching doesn’t help much. You could batch multiple users’ prompts together, but each user’s prompt is already big enough to saturate the GPU. Prefill is mostly batched at “1 user with a long prompt” rather than “many users with short prompts.”

21.7 The hardware-choice consequence

If decode is memory-bandwidth-bound, then the primary metric you should optimize for in decode-heavy serving is HBM bandwidth, not FLOPs.

This drives several hardware choices:

H100 (3 TB/s HBM3) vs H200 (4.8 TB/s HBM3e): the H200 has the same compute as H100 but ~60% more HBM bandwidth. For decode-bound workloads, H200 is ~60% faster. For prefill-bound workloads, H200 is no faster than H100.
B200 (8 TB/s HBM3e) vs H100: B200 has roughly 2.5× the HBM bandwidth of H100. For decode, B200 is ~2.5× faster. For prefill, B200’s compute is ~2.5× faster too, so the speedup is similar but it’s in the compute-bound regime.
Consumer GPUs (4090: 1 TB/s) vs datacenter GPUs (H100: 3 TB/s): consumer GPUs have much less bandwidth for the same compute, so they’re worse at decode. Even with the same compute, you get fewer tokens per second.

The senior insight: a 70B-model serving stack is bandwidth-bound, not compute-bound, in the dominant case (decode). This means hardware choice should be driven by HBM bandwidth, not by peak TFLOP/s. The label on the box says “989 TFLOP/s on H100” but the number you actually care about for serving is “3 TB/s of HBM bandwidth.”

The same logic explains why fast HBM3e (2024) drove inference speedup more than any compute-side optimization in the same period. The compute was already plenty fast for decode; the bandwidth was the bottleneck.

21.8 The disaggregation consequence

If prefill and decode have such different bottlenecks, why are we running them on the same GPU? This is the question that motivates disaggregated serving (Chapter 36).

The pitch: separate GPUs for prefill and decode. The prefill GPUs are sized for compute (lots of FLOPs, less HBM bandwidth needed); the decode GPUs are sized for HBM bandwidth (lots of bandwidth, less FLOPs needed). The KV cache is generated by the prefill GPUs and shipped to the decode GPUs over a fast interconnect (NVLink or RDMA).

The benefits:

No prefill blocking decode. In a co-located setup, a long prefill (say, a 32k-token context) blocks the decode pipeline for hundreds of milliseconds. Disaggregation removes this interference.
Each phase runs on its optimal hardware. Prefill on compute-rich GPUs, decode on bandwidth-rich GPUs. Each is closer to its theoretical maximum.
Independent autoscaling. You can scale prefill and decode capacity separately based on the workload mix.

The costs:

KV cache transfer. You have to ship the K and V tensors from prefill GPUs to decode GPUs every time a request transitions phases. This is hundreds of MB per request and demands fast interconnect (NVLink for intra-node, RDMA over InfiniBand for inter-node).
Operational complexity. You’re running two pools of GPUs instead of one, with a routing layer in between.

Disaggregated serving is the production reality for the most demanding workloads (e.g., vision-language models with very long prefill) and is a research-frontier topic for general LLM serving. We’ll cover it in detail in Chapter 36.

The relevant point for this chapter: the entire concept of disaggregation only makes sense once you internalize the prefill/decode asymmetry. Without that, “split the work across two pools” sounds like over-engineering. With it, it sounds like the obvious next step.

21.9 Time-to-first-token vs time-per-output-token

The user-facing metric splits along the prefill/decode line:

Time-to-first-token (TTFT): the wall-clock time from the user submitting a request until the first output token streams back. Dominated by prefill.
Time-per-output-token (TPOT): the average time between successive output tokens during streaming. Dominated by decode.

TTFT is dominated by prefill and drives perceived snappiness; TPOT is dominated by decode and drives total response time for long completions.

For a chat application with a 2000-token system prompt + RAG context + user message:

TTFT ≈ prefill_time(2000) ≈ 0.3 seconds
TPOT ≈ ~50 ms / token

For a 200-token completion, the total response time is:

TTFT + 200 × TPOT ≈ 0.3 + 200 × 0.05 = 10.3 seconds

The 0.3 of prefill is small compared to the 10 seconds of decode. But the user perceives TTFT very strongly — they’re staring at a blank chat box during that 0.3 seconds — and the total response time is dominated by decode.

This split has consequences for what to optimize:

For perceived snappiness, optimize prefill. Reduce the prompt length, use prefix caching to skip prefill on cached prompts, scale prefill capacity.
For total throughput, optimize decode. Use continuous batching, quantize weights to reduce HBM bandwidth requirements, use better hardware for HBM.

The serving stack you build is shaped by which of these matters more for your workload. RAG-heavy systems (long prompts, short responses) are prefill-bound and benefit most from prefix caching. Reasoning-heavy systems (short prompts, very long responses) are decode-bound and benefit most from continuous batching and faster HBM.

21.10 The mental model

Eight points to take into Chapter 22:

Prefill and decode are completely different workloads. This is the spine of inference design.
Prefill is compute-bound (AI ≈ 678) and parallelizes across the prompt.
Decode is memory-bandwidth-bound (AI ≈ 1) and is sequential per request.
The crossover on H100 is ~330 FLOPs/byte. Prefill is above; decode is far below.
Continuous batching makes decode efficient by stacking many users’ decode steps together.
Hardware choice should be driven by HBM bandwidth for decode-heavy workloads, not by FLOPs.
TTFT is prefill; TPOT is decode. Optimize each separately.
Disaggregation is the natural conclusion of the asymmetry — the topic of Chapter 36.

In Chapter 22 we look at the data structure that makes this entire restructuring possible: the KV cache.

Read it yourself

Williams et al., Roofline: An Insightful Visual Performance Model for Multicore Architectures (2009). The roofline paper. Read for the framework.
Horace He’s blog post Making Deep Learning Go Brrrr From First Principles — the cleanest explanation of arithmetic intensity for ML practitioners.
The vLLM blog post on continuous batching, which explains the prefill/decode framing in operational terms.
The DistServe paper (Zhong et al., 2024) for the disaggregated framing.
NVIDIA’s Hopper architecture white paper for the actual H100/H200 numbers.

Practice

Compute the arithmetic intensity of a (2048, 4096) @ (4096, 4096) → (2048, 4096) matmul. Is it compute-bound or memory-bound on an H100?
Compute the arithmetic intensity of a (1, 4096) @ (4096, 4096) → (1, 4096) matmul. Same question.
For a 70B model in bf16, compute the minimum decode latency per token as a function of HBM bandwidth. Plot it for H100 (3 TB/s), H200 (4.8 TB/s), and B200 (8 TB/s).
A user sends a 2000-token prompt and asks for a 500-token completion. Estimate TTFT and TPOT on a 70B Llama running on a single H100 with vLLM. What’s the total response time? Where’s the bulk of it?
Why does continuous batching help decode but not prefill? Trace the arithmetic intensity in both cases.
Why is HBM bandwidth a better hardware metric than peak TFLOP/s for serving a 70B model? Construct an interview-grade answer in three sentences.
Stretch: Use vLLM’s profiling tools to measure the actual TTFT and TPOT for a real model on a real GPU. Compare to the theoretical estimates from this chapter.

Concept check