The decoding loop: autoregressive generation, sampling, controllability
"The model's job is to give you a probability distribution over the next token. Everything else — temperature, top-p, repetition penalty, beam search, JSON mode — is what you do with that distribution"
In Chapter 7 we built a transformer that maps (N, S) token IDs to (N, S, vocab_size) logits. In this chapter we use those logits to actually generate text. This is the autoregressive decoding loop, and it’s where the model becomes a generator instead of a classifier.
The chapter splits cleanly into two parts:
- The mechanical loop — how the model goes from a prompt to a completed response, one token at a time, and how prefill and decode have radically different costs.
- Sampling and controllability — the family of techniques that turn the raw next-token distribution into the actual sampled token: greedy, temperature, top-k, top-p, min-p, repetition penalty, beam search, and the controlled-decoding techniques used in production.
By the end you’ll know exactly what every generation parameter in model.generate() does, why temperature 0 is not actually deterministic in practice, and where the prefill/decode asymmetry that we’ll exploit in later chapters comes from.
Outline:
- The autoregressive loop, end to end.
- Prefill vs decode — the asymmetry you must internalize.
- Logits → probabilities: softmax with a temperature dial.
- Greedy decoding.
- Temperature scaling.
- Top-k sampling.
- Top-p (nucleus) sampling.
- Min-p sampling.
- Repetition penalty and presence/frequency penalties.
- Beam search and why it died for LLMs.
- EOS handling and stop sequences.
- Why temperature 0 isn’t actually deterministic.
- Speculative and constrained decoding — forward pointers.
8.1 The autoregressive loop
The setup: you have a trained decoder-only transformer (Chapter 7) and a tokenized prompt of length S_prompt. You want to generate T tokens of completion.
The loop, in pseudocode:
def generate(model, prompt_ids, T):
tokens = prompt_ids # (S_prompt,)
for step in range(T):
logits = model(tokens) # (S_current, vocab_size)
next_token_logits = logits[-1] # (vocab_size,) — last position
next_token = sample(next_token_logits)
tokens = torch.cat([tokens, next_token.unsqueeze(0)])
if next_token == EOS:
break
return tokens
That’s the entire loop, ignoring the KV cache (which we’ll add in Chapter 22). At each step, you run the whole sequence through the model, take the logits at the last position, sample a token from that distribution, append it, and repeat.
Two things to notice immediately:
(1) Generation is sequential. You cannot generate token t+1 until you’ve sampled token t, because token t is part of the input that produces token t+1. This is the deep source of the latency-vs-throughput tradeoff in LLM serving: you can batch many independent generations across users, but within a single generation, the steps cannot be parallelized.
(2) The forward pass is doing the same work over and over. At step t, we ask the model to compute logits for every position in the sequence — including all the positions we already computed at step t-1. The K and V vectors for positions 0..t-1 haven’t changed; we’re computing them fresh every step. This is a massive waste, and the entire point of the KV cache (Chapter 22) is to fix it.
For now, ignore the inefficiency. Focus on the structure: prefill the prompt once, then generate one token at a time, conditioned on everything generated so far.
8.2 Prefill vs decode — the most important asymmetry in LLM serving
The naive loop hides one of the most consequential facts about LLM inference: the first forward pass and every subsequent forward pass are completely different workloads.
The first forward pass processes the entire prompt at once. If the prompt is 2000 tokens, this is one transformer forward over (N=1, S=2000). The attention block does a (2000, d_h) @ (d_h, 2000) matmul per head, the FFN does a (2000, d_ffn) @ (d_ffn, d_model) matmul, and so on. Every single matmul is large. This phase is compute-bound: the GPU is doing useful arithmetic on big tensors and the FLOPs are high.
This is called prefill. It’s where the prompt’s KV cache gets filled in.
Every subsequent forward pass processes exactly one new token — the most recently generated one. The matmul shapes are now (1, d_h) @ (d_h, S_so_far) per head and (1, d_ffn) @ (d_ffn, d_model) for the FFN. Every matmul is tiny in the sequence dimension. The amount of arithmetic is small, but the model still has to read all of its weight matrices from HBM to do that arithmetic. So you’re spending 100% of the time moving weights and almost no time computing on them. This phase is memory-bandwidth-bound.
This is called decode.
The two phases have wildly different efficiency characteristics:
| Phase | Work per step | Bottleneck | Per-token cost | Batchable? |
|---|---|---|---|---|
| Prefill | one big matmul over the whole prompt | compute (FLOPs) | very low (parallel) | well |
| Decode | one tiny matmul per layer | memory bandwidth | high (sequential) | only across users |
Three concrete consequences:
(1) Time-to-first-token (TTFT) is dominated by prefill. A long prompt produces a long prefill phase, which means a long wait before the user sees the first token. For chat applications with system prompts and RAG context, prefill can be most of the latency.
(2) Time-per-output-token (TPOT) is dominated by decode. Once generation starts, every token takes roughly the same time, and that time is set by how fast you can stream the model weights through HBM. A 70B model in bf16 has 140 GB of weights, and an H100 has ~3 TB/s of HBM bandwidth, so the minimum time per token is 140 / 3000 ≈ 47 ms. You cannot go faster than that without quantizing the weights, sharding across more GPUs, or using speculative decoding.
(3) Continuous batching helps decode but not prefill. If you have many users sending requests, you can stuff their decode steps together (batch dim grows) and amortize the weight bandwidth cost. The matmuls are small enough that adding more batch is essentially free until you fill the GPU. We’ll see this in Chapter 23.
This asymmetry is the seed of disaggregated serving (Chapter 36). If prefill is compute-bound and decode is memory-bound, why not run them on different hardware? That’s exactly the idea behind systems like DistServe and the production implementations we mentioned in earlier chapters. For workloads where the prompts are long and the completions are short — like vision-language with image inputs — disaggregation is a clear win.
You will be asked about the prefill/decode asymmetry in interviews. The right answer is: “prefill is compute-bound and parallelizable across the prompt; decode is memory-bandwidth-bound and serial across tokens; the entire serving optimization stack is built around exploiting this difference.”
8.3 Logits → probabilities
The model emits logits: real numbers, one per vocabulary token, of shape (vocab_size,). We need to turn them into a probability distribution and then sample. The standard tool is softmax with a temperature:
p_i = exp(z_i / T) / Σ_j exp(z_j / T)
T is the temperature. Read what it does:
T = 1is plain softmax. Use the model’s distribution as-is.T < 1sharpens the distribution. The largest logits get even more probability mass; the small ones get even less. In the limitT → 0, the distribution becomes a single spike on the largest logit (this is greedy decoding).T > 1flattens the distribution. The relative differences between logits get compressed; everything becomes more uniform. In the limitT → ∞, the distribution becomes uniform (random token selection).
Temperature is the single most important sampling knob. It controls the diversity of the output, with no other side effects. Low temperature gives confident, conservative, repetitive output. High temperature gives diverse, creative, sometimes nonsensical output. The sweet spot for most chat applications is T ≈ 0.7, but the right value depends entirely on the task.
8.4 Greedy decoding
The simplest sampler: always pick the token with the highest logit.
def greedy(logits):
return logits.argmax(dim=-1)
That’s the whole sampler. Equivalent to T = 0 (or top_k = 1, which we’ll see in §8.6). Always picks the model’s highest-probability choice.
Greedy is deterministic (in theory — see §8.12 for why “in theory” matters). It’s also boring: the model often picks the same starting words for similar prompts, gets stuck in repetition loops, and produces output that feels stilted.
Use greedy when:
- You want reproducible output for the same prompt.
- You’re running an eval and need to compare model outputs across runs.
- You’re doing structured generation where exactly one answer is correct.
Don’t use greedy for chat. It feels mechanical.
8.5 Temperature scaling, properly understood
Already introduced in §8.3. To be concrete about what it actually does:
def sample_with_temperature(logits, T):
logits = logits / T
probs = logits.softmax(dim=-1)
return torch.multinomial(probs, num_samples=1)
The multinomial function draws a single sample from the categorical distribution defined by the probabilities. It’s the “draw a marble from a weighted bag” operation.
Common temperature values:
| Temperature | Effect |
|---|---|
| 0.0 | Equivalent to greedy. |
| 0.1–0.3 | Very conservative; useful for code, math, factual answers. |
| 0.7 | The “default” for chat. Good balance. |
| 1.0 | Use the model’s raw distribution. |
| 1.2–2.0 | Very creative, often incoherent. |
Temperature alone is not enough for production. The problem is the long tail: even at moderate temperatures, the model assigns nonzero probability to thousands of “wrong” tokens, and occasionally one of them gets sampled, producing a glitch token in the middle of an otherwise coherent response. You want to prune the tail.
That’s what top-k and top-p do.
8.6 Top-k sampling
The simplest tail-pruning technique: only consider the top k most-probable tokens, set everything else to -∞, then sample (with temperature) from the truncated distribution.
def top_k(logits, k):
top_k_values, _ = logits.topk(k, dim=-1)
threshold = top_k_values[..., -1, None]
return logits.masked_fill(logits < threshold, float('-inf'))
Common values: k = 40 or k = 50 for chat. The exact value isn’t critical; anywhere from 20 to 100 produces similar behavior.
The problem with top-k is that k is constant across positions, but the right number of plausible tokens varies wildly. After a system prompt like "The capital of France is", there is essentially one plausible next token (Paris), and k = 50 is needlessly permissive — the 50th-best token is still in the running. After a vague prompt like "Tell me a story", there are thousands of plausible openings, and k = 50 is too restrictive.
The fix: top-p.
8.7 Top-p (nucleus) sampling
Top-p sampling, introduced by Holtzman et al. (2019) in The Curious Case of Neural Text Degeneration, replaces the fixed k with a fixed cumulative probability threshold:
def top_p(logits, p):
sorted_logits, sorted_indices = logits.sort(descending=True)
cumulative_probs = sorted_logits.softmax(dim=-1).cumsum(dim=-1)
sorted_indices_to_remove = cumulative_probs > p
# always keep the first one
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = False
indices_to_remove = sorted_indices_to_remove.scatter(
dim=-1, index=sorted_indices, src=sorted_indices_to_remove
)
return logits.masked_fill(indices_to_remove, float('-inf'))
The idea: sort the tokens by probability, take the smallest set whose cumulative probability is at least p, throw the rest away, sample from what’s left.
This adapts to the situation. After "The capital of France is", the top token (Paris) might already have probability 0.95 — top-p will prune everything else and leave just Paris. After "Tell me a story", no single token is dominant, and top-p will keep hundreds of tokens in play.
Common values: p = 0.9 or p = 0.95. The closer to 1, the more permissive.
Top-p is the default sampler in most modern LLM APIs and serving stacks. When you call OpenAI’s API with the default settings, you’re using top-p with a temperature.
Top-p and top-k can be combined: apply top-k first to bound the worst case, then top-p to adapt within that bound. This is what HuggingFace’s generate() does by default.
8.8 Min-p sampling
The newest member of the family. Min-p (Nguyen et al., 2023) keeps every token whose probability is at least some fraction of the most-probable token’s probability:
def min_p(logits, min_p):
probs = logits.softmax(dim=-1)
top_prob = probs.max(dim=-1, keepdim=True).values
threshold = top_prob * min_p
return logits.masked_fill(probs < threshold, float('-inf'))
So min_p = 0.1 means “keep any token whose probability is at least 10% of the top token’s probability.”
The motivation: top-p has weird behavior when the distribution is very flat. If the top token has probability 0.05 and the next 19 are at 0.045 each, top-p with p = 0.95 keeps all of them — but that’s because they’re all close together, not because the model thinks any of them is great. Min-p ignores the cumulative threshold and uses a relative one, which is more robust to distribution shape.
Min-p has been gaining popularity in the open-source LLM community. It’s a nice tool to have in the kit.
8.9 Repetition penalty and presence/frequency penalties
A separate problem from “what tokens are plausible” is “the model keeps repeating itself.” Greedy and even moderate-temperature sampling have a strong tendency to fall into repetition loops, especially with smaller models. The fix is to penalize tokens the model has already produced before sampling.
The original repetition penalty (Keskar et al., CTRL paper, 2019):
logit[token] /= penalty if token is in the generated history (and logit > 0)
logit[token] *= penalty if token is in the generated history (and logit < 0)
The asymmetric handling of positive and negative logits is so that the penalty consistently reduces the probability of repeated tokens, regardless of the sign of the logit. Common values: penalty = 1.1 to 1.3. Higher means more aggressive de-repetition, but too high and the model starts inventing words to avoid repeating.
The OpenAI variant uses two parameters: presence penalty (constant subtraction from any token that has appeared at all) and frequency penalty (subtraction proportional to how often the token has appeared). The OpenAI presence/frequency penalties are additive in logit space, while the CTRL repetition penalty is multiplicative. They are not the same thing, and the values you use for one are not interchangeable with the other.
Modern best practice: a small repetition penalty (1.05–1.1) is almost always worth it; large values are dangerous. The DRY (Don’t Repeat Yourself) sampler is a more sophisticated alternative gaining popularity in the open-source community — it specifically penalizes repeated n-grams rather than individual tokens.
8.10 Beam search, and why it died for LLMs
Beam search keeps the top b most-likely sequences in flight at every step, expanding each by every possible next token, scoring all b × vocab_size candidates by total log-probability, and pruning to the top b. At the end, you return the highest-scoring complete sequence. With b = 1, beam search is greedy. With larger b, it’s more exhaustive.
Beam search is the right algorithm for tasks where there is one correct output and you want the model’s most likely guess: machine translation, image captioning, summarization. For these tasks, beam search consistently produces better outputs than greedy or sampling because it explores multiple high-probability paths.
For open-ended generation — chat, creative writing, code completion — beam search fails. The reason is that “high total log-probability” is not the same as “high quality.” Beam search consistently finds outputs that are bland, repetitive, and short, because short bland outputs naturally have higher total probability than long interesting ones. The Holtzman paper that introduced top-p was largely a critique of beam search for open-ended generation.
The result: nobody uses beam search for chat-style LLMs anymore. It survives in machine translation and a few other one-correct-answer tasks. If you see it in an LLM API, it’s vestigial.
8.11 EOS handling and stop sequences
The model is trained to emit a special end-of-sequence (<eos>, </s>, <|endoftext|>, <|eot_id|>, etc.) token when its response is complete. The decoding loop checks for this token after each step and breaks out of the loop when it appears.
if next_token == EOS:
break
For instruction-tuned models, the EOS token is typically the <|eot_id|> (end-of-turn) token from the chat template — not the raw <|end_of_text|>. This is important: stopping on the wrong token either ends the response too early or runs the model past the natural stopping point.
In addition to the EOS token, you usually want stop sequences: arbitrary strings that, if generated, terminate generation. The OpenAI API exposes stop=["\nUser:", "\n\n"] as a parameter. Internally, the serving stack tokenizes the stop sequence, runs the generation, and after each new token checks whether the recent generation ends with the stop sequence as a suffix.
Stop sequences are one of the most common sources of bugs in LLM serving. The pitfalls:
- Tokenization mismatch. The stop sequence might tokenize differently depending on what came before it (due to the leading-space gotcha from Chapter 5). Comparing decoded text rather than token IDs is more robust.
- Streaming partial matches. If you’re streaming tokens to a client and the stop sequence is
"User:", the model might generate"User"then keep going to"User said"— and you should not have streamed the"User"token to the client until you knew whether it was part of the stop sequence. This requires buffering. - Multi-token stop strings. Stop sequences are often more than one token. The serving code has to track partial matches across tokens.
If you’re using a serving framework (vLLM, TGI, SGLang) the stop-sequence handling is built in. If you’re rolling your own decoding loop, this is one of the parts to test carefully.
8.12 Why temperature 0 isn’t actually deterministic
Here’s a small interview-favorite gotcha: greedy decoding (or temperature 0) is theoretically deterministic — you always pick the argmax. In practice, the same model, the same prompt, the same hardware, and the same library version can produce different outputs across runs.
Three reasons:
(1) Floating-point matmul is non-associative. A matrix multiply on the GPU is computed as a sum of products. The order of summation depends on the kernel’s tiling strategy, the number of CUDA cores active, and sometimes scheduling decisions made at runtime. Different orderings produce slightly different floating-point sums. Most of the time the difference is invisible. Occasionally — when two logits are very close — the rounding tips the argmax to a different token. Once that happens, the trajectories diverge and the outputs are completely different.
(2) Tensor parallelism aggregation order. When a model is split across GPUs with tensor parallelism (Chapter 28), each GPU computes a partial result and then they get summed via an all-reduce. The order of the all-reduce depends on the network topology and scheduling, and again, different orders produce different rounding. Models served at different parallelism degrees can produce different outputs from the “same” greedy decoding.
(3) Precision casting boundaries. Some operations (softmax, layer norm) get computed in fp32 even when the rest of the model is in bf16. The places where the precision changes have rounding boundaries that depend on subtle implementation choices.
The practical implication: “deterministic output for the same prompt” is a property of your serving stack, not of the model. To get true determinism you have to pin the kernel implementations, the parallelism configuration, the precision, the batch size, and a few other things. Most production systems give up on this and just accept that two temperature=0 calls can produce different results.
This is also why benchmark numbers from different vendors are hard to compare — even at temperature 0, two implementations of the same model can disagree on a few percent of MMLU questions.
8.13 Speculative and constrained decoding — forward pointers
Two important decoding techniques we’ll cover later that don’t fit in this chapter:
Speculative decoding (Chapter 27): use a small “draft” model to propose multiple tokens at once, then have the big model verify them in parallel. If the draft model is good, you get multiple tokens per big-model forward pass. The “guesses” can be wrong, in which case you fall back to the verified prefix. Production speedups of 2–3× are common.
Constrained / structured decoding (Chapter 43): force the model to emit output that matches a specified grammar (JSON schema, regex, SQL syntax). This is done by masking out invalid tokens at each step before sampling — the model can only sample from the set of tokens that are legal next characters in the grammar. Tools like Outlines, XGrammar, and JSON-mode in OpenAI all do this. The cost is small (a small number of token-set computations per step) and the benefit is enormous: you can get parseable JSON 100% of the time instead of 99%.
Both techniques modify the loop in §8.1 in interesting ways without changing its fundamental shape. We’ll get to them in Part III.
8.14 The mental model
Eight points to take into Chapter 9:
- Decoding is a sequential loop. You sample one token at a time, conditioned on everything generated so far.
- Prefill and decode are completely different workloads. Prefill is compute-bound, batchable, and dominates time-to-first-token. Decode is memory-bandwidth-bound, sequential, and dominates time-per-output-token.
- Temperature sharpens or flattens the distribution. It is the single most important sampling knob.
- Top-p (nucleus) is the production default. It adapts to the local distribution shape better than top-k.
- Beam search is dead for chat. It survives only in tasks with one correct answer (translation, captioning).
- Repetition penalties are necessary for small models and helpful for large ones. Use them in moderation.
- EOS and stop sequences are where serving bugs live. Use a serving framework or test carefully.
- Temperature 0 is not deterministic in practice. Floating-point non-associativity, tensor-parallelism order, and precision boundaries all conspire.
In Chapter 9 we shift from the generative side of transformers to the discriminative side: embeddings and rerankers, the encoder-only models that power retrieval and search.
Read it yourself
- Holtzman et al., The Curious Case of Neural Text Degeneration (2019) — the paper that introduced top-p and explained why beam search is wrong for open-ended generation.
- The HuggingFace
transformersGenerationConfigdocs — the practical reference for every sampling parameter. - Andrej Karpathy’s minbpe video and the section in Let’s build GPT that walks through the decoding loop with code.
- The vLLM documentation on sampling parameters — read it once. It is the cleanest source on what each parameter does in production.
- The DRY sampler RFC and discussion threads in
text-generation-webui— for the latest in repetition handling.
Practice
- Implement greedy, top-k, top-p, and min-p samplers in PyTorch from scratch in under 50 lines total.
- For a vocabulary of size 32000 and a model that produces logits where the top 5 are
[10, 9.9, 9.8, 5, 4]and the rest are around 0, predict whattop_p(logits, p=0.9)keeps. Verify by running it. - Why does top-p adapt to the local distribution shape better than top-k? Construct two prompt situations where top-k has the wrong behavior and top-p has the right one.
- Why does beam search produce shorter and blander outputs for open-ended generation than sampling? Trace through the scoring on a toy two-step generation.
- A 70B model in bf16 has 140 GB of weights. An H100 has 3 TB/s of HBM bandwidth. Compute the lower bound on per-token latency in pure decode. (Answer: ~47 ms.) Now compute the same for a 7B model. Why is the 7B latency so much lower?
- Why does temperature 0 fail to be deterministic across runs? List three independent causes.
- Stretch: Write a stop-sequence handler that correctly buffers partial matches and only commits tokens to the output stream when they are no longer ambiguous. Test it on the stop sequence
"User:"against a stream that contains"User said".
Concept check
4 questions. Click a choice to check. Your score is saved locally.
- 1. Prefill and decode have radically different performance profiles. The key reason is
- 2. Top-p (nucleus) sampling selects from the smallest set of tokens whose cumulative probability exceeds p. Compared to top-k, the main advantage is
- 3. Setting temperature to 0 should produce the same greedy output every time. In practice this often fails because
- 4. Beam search was the dominant decoding algorithm for seq2seq models but is rarely used for large LLMs. The main reason is