Part III · Inference Internals & Production Serving
Chapter 27 Deep dive ~19 min read

Speculative decoding: Medusa, EAGLE, MTP

"Decode is sequential, except when it isn't"

We established in Chapter 21 that decode is sequential — you can’t generate token t+1 until you’ve generated token t, because token t is part of the input that produces t+1. This sequentiality is the deepest constraint on inference speed: continuous batching gets around it by parallelizing across users, but within a single user’s generation, you’re locked into one token at a time, dominated by HBM bandwidth.

Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) is the technique that breaks this constraint, partially. The idea: use a small fast “draft” model to guess multiple tokens at once, then have the big “target” model verify them in parallel. If the guesses are correct, you’ve gotten multiple tokens for the cost of one big-model forward pass. If they’re wrong, you fall back to the verified prefix and try again. The expected speedup is the number of tokens the draft model gets right per attempt, which is typically 2–4× on real workloads.

This chapter covers the basic algorithm, the math of why it preserves the target model’s exact output distribution, and the modern variants that have refined the technique.

Outline:

  1. The sequential decoding bottleneck.
  2. The basic speculative decoding algorithm.
  3. Why it preserves the target distribution.
  4. Choosing a draft model.
  5. The acceptance rate, and what determines it.
  6. Self-speculative variants: Medusa, Lookahead.
  7. EAGLE, EAGLE-2, EAGLE-3.
  8. Multi-token prediction (MTP) and the DeepSeek approach.
  9. Operational considerations.
  10. When speculative decoding is not worth it.

27.1 The sequential decoding bottleneck

In a normal decoding loop, each step does:

  1. Run the big model on the latest context.
  2. Sample the next token from the resulting distribution.
  3. Append the token and repeat.

Each iteration produces exactly one new token for the cost of one big-model forward pass. The wall-clock per token is set by the big model’s decode latency, which is model_size / HBM_bandwidth for the memory-bound regime — about 47 ms/token for a 70B model on H100.

The key observation: the big model’s forward pass is barely doing any compute. Decode at AI ≈ 1 means the GPU is mostly idle, waiting for HBM. If you could give the big model a batch of several candidate next-tokens to evaluate in the same forward pass, the cost would be roughly the same (still memory-bound), but you’d get multiple tokens per pass.

The trick is figuring out which candidate next-tokens to give the model. The naive approach — try every possible vocab token — would require vocab_size candidates, which is too many. The clever approach is to let a smaller model guess the candidates, and have the big model verify them.

27.2 The basic algorithm

Speculative decoding has two models:

  • Target model: the big model whose distribution you want to sample from. The “real” model.
  • Draft model: a smaller, faster model that approximates the target. Typically 10–100× smaller. Could be a smaller version of the same family (Llama 3 8B drafting for Llama 3 70B), a distilled model, or a specifically trained draft head.

The algorithm, per iteration:

Speculative decoding timing: draft model generates K tokens sequentially, target model verifies all K tokens in one parallel forward pass. One speculative decoding iteration (K=4 draft tokens) time → Draft d₁ d₂ d₃ d₄ 4 sequential auto-regressive steps (cheap — small model) Target verify [ctx, d₁, d₂, d₃, d₄] — one forward pass computes p(·) at all 5 positions simultaneously Accept? ✓ d₁ ✓ d₂ ✓ d₃ ✗ d₄ resample from p_residual Net: 3 tokens accepted in time ≈ 1 target forward pass
Speculative decoding runs the cheap draft model to produce candidates, then verifies all of them in a single target forward pass — each accepted token costs essentially nothing because the target was already reading its weights.
  1. Draft phase. The draft model generates K candidate tokens autoregressively (one token at a time, but cheap because the draft is small). Call them d_1, d_2, ..., d_K.
  2. Verify phase. The target model runs one forward pass with the input being [context, d_1, d_2, ..., d_K]. It computes the target model’s distribution at each position from 0 to K. This is one forward pass — the verification of all K+1 tokens at once.
  3. Accept/reject phase. Walk through the draft tokens left to right. For each d_i, compare its probability under the target distribution vs the draft distribution. If the target model “would have produced this token with at least the same likelihood as the draft model,” accept it. Otherwise, reject.
  4. Resample on rejection. When a draft token is rejected at position i, sample a new token from the adjusted target distribution at position i (specifically, the target distribution minus the rejected draft contribution, normalized).
  5. Continue. Append the accepted draft tokens (and the resampled token, if any) to the context, and repeat.

The number of accepted tokens per iteration is variable. In the best case all K are accepted plus one bonus from resampling — K+1 tokens for one big-model pass. In the worst case, zero are accepted plus one bonus — 1 token, same as regular decoding.

27.3 Why it preserves the target distribution

The deep mathematical fact about speculative decoding: the resulting tokens are sampled from exactly the target model’s distribution, not from the draft’s. This is critical — you don’t want to use a worse model just because it’s faster. Speculative decoding is “free” speedup in the sense that the output distribution is identical to running the target model normally.

The proof is in the original Leviathan et al. paper. The key step is the rejection criterion. Let p be the target distribution and q be the draft distribution at a given position. The draft has sampled token x from q. We accept x with probability:

acceptance_prob = min(1, p(x) / q(x))

If p(x) ≥ q(x) (the target likes this token at least as much as the draft did), we always accept. If p(x) < q(x) (the target likes it less), we accept with probability p(x) / q(x).

When we reject, we resample from the residual distribution:

p_residual(y) = max(0, p(y) - q(y)) / Σ_z max(0, p(z) - q(z))

This is the part of the target distribution that the draft “missed.” Sampling from p_residual gives a token from the part of the distribution where p exceeds q, which is exactly what we need to make the overall acceptance distribution equal to p.

The math works out: the marginal distribution of the resulting token is exactly p, regardless of what q was. This is the magic. You can use any draft distribution; the output is still correct. The only thing that changes is the acceptance rate (and therefore the speedup).

This is why speculative decoding is “free”: the output is identical to running the target model normally, just faster on average.

27.4 Choosing a draft model

The draft model needs three properties:

(1) Cheap to run. The draft has to be much faster than the target, otherwise the speedup disappears. Typical draft models are 10–100× smaller than the target. For Llama 3 70B as target, Llama 3 8B (or even smaller) is a common draft choice.

(2) High agreement with the target. The acceptance rate is determined by how often the draft’s q(x) is close to the target’s p(x). A draft that is wildly different from the target will have low acceptance and no speedup.

(3) Same vocabulary. The draft and target must use the same tokenizer so you can compare token-by-token probabilities. This means you can’t mix and match across model families easily — Llama 3 family drafts for Llama 3 family targets, Qwen family for Qwen family, etc.

The standard pairings:

  • Llama 3 70B target, Llama 3 8B draft: works well, because the 8B is from the same training data and architecture family. Acceptance rate ~70-80% on common workloads.
  • Llama 3 405B target, Llama 3 8B draft: also works, even larger speedup ratio.
  • Custom distilled draft: train a small model specifically to mimic the target. Higher acceptance rate, more setup cost. The Medusa and EAGLE approaches go further (see §27.6, §27.7).

The size ratio matters because of the expected speedup formula. If the draft is D times faster than the target, and the average accepted run length is L, the expected speedup is roughly:

speedup ≈ L / (1 + L/D) ≈ L  (for D >> L)

You want a draft that’s much faster than the target and an L (acceptance length) that’s high. Both axes matter.

27.5 The acceptance rate

The acceptance rate is the fraction of draft tokens that the target accepts. It’s the key knob in speculative decoding’s performance, and it depends on:

  • How closely the draft matches the target. Higher = better.
  • The temperature. Lower temperature (greedier sampling) means the target is more peaked, and the draft is more likely to guess the peak token. Acceptance is highest at temperature 0.
  • The input distribution. Easy prompts (with predictable next tokens) have higher acceptance than hard prompts.
  • The number of speculation tokens K. The more you speculate, the more chances the draft has to be wrong somewhere in the chain. Acceptance rate per token decreases as K grows.

Typical numbers for Llama 3 70B with Llama 3 8B as draft:

  • Temperature 0 (greedy): acceptance rate ~75-85%, expected speedup ~2.5-3×.
  • Temperature 0.7: acceptance rate ~50-60%, expected speedup ~1.8-2.2×.
  • Temperature 1.0: acceptance rate ~35-45%, expected speedup ~1.5-1.8×.

The picture: speculative decoding is most useful at low temperature. At high temperature, the target distribution is so flat that the draft can’t predict it reliably, and the acceptance rate drops.

Expected speculative decoding speedup as a function of per-token acceptance rate at K=5 draft tokens. Per-token acceptance rate → Speedup 35% 60% 80% 90% ~1.5× (T=1.0) ~2.8× (T=0.7) ~3.5× (T=0)
Acceptance rate is the key driver of speculative decoding speedup — low temperature (greedy decoding) produces near-deterministic distributions that draft models predict accurately, maximizing accepted tokens per pass.

The K hyperparameter is also tuned per workload. The optimal K is typically 4-7. Higher K means more tokens guessed per round, but lower per-token acceptance probability, and a longer wasted draft when something is rejected early.

27.6 Self-speculative variants: Medusa and Lookahead

The original speculative decoding requires a separate draft model, which is operationally awkward (two models in memory, two loading paths, two sets of weights to manage). Self-speculative approaches eliminate the draft model by using the target model itself.

Medusa (Cai et al., 2023)

Add multiple “Medusa heads” to the target model. Each head is a small linear layer that predicts a token several positions ahead — head 1 predicts the next token, head 2 predicts the token after that, etc. The heads are trained on the target model’s outputs (a quick fine-tune of just the heads, leaving the base model frozen).

At inference, the target model produces multiple speculative tokens in one forward pass via the heads, and a verification pass (similar to standard speculative decoding) accepts or rejects them.

Medusa is operationally simpler than vanilla speculative decoding: only one model is loaded, the heads are tiny, and the integration with serving stacks is cleaner. The acceptance rate is lower than vanilla speculative decoding (because the heads are simple linear projections, not a full draft model), but the simplicity makes it competitive.

Speedups: ~2× on most workloads.

Lookahead Decoding (Fu et al., 2024)

Another self-speculative approach. The idea: the target model maintains a “Jacobi pool” of n-gram completions seen during decoding, and at each step, it tries to match the current generation against the pool. When a match is found, multiple tokens can be accepted at once.

Lookahead doesn’t require any training (no draft model, no Medusa heads). It works on any pretrained model out of the box. Speedups are smaller (~1.5×) but the operational simplicity is appealing.

27.7 EAGLE, EAGLE-2, EAGLE-3

EAGLE (Li et al., 2024) is the current state-of-the-art for speculative decoding. The key insight: instead of training the draft to predict the next token directly, train it to predict the target model’s hidden state.

EAGLE architecture: a tiny draft head consumes the target model's hidden states and predicts the next hidden state to guess future tokens. Target model (70B — full forward) emits h_t, logits h_t EAGLE head (1 layer, ~50M) ĥ_{t+1} Draft tokens d₁, d₂, …, d_K target verifies all draft tokens in 1 forward pass → accept / reject
EAGLE predicts the target's next hidden state rather than the next token, giving the draft head rich representational signal — this is why EAGLE achieves 3–4× speedup versus vanilla speculative decoding's 2–2.5×.
The draft model is much smaller (a single transformer layer or small head), and it's fed the target's hidden states from previous tokens, plus the embedding of the current draft token.

The technique exploits two facts:

  1. The target’s hidden state contains much more information than the next-token prediction.
  2. A small model can learn to predict the target’s next hidden state fairly accurately if it has the previous hidden states as input.

EAGLE-2 added tree-based sampling: instead of generating a single sequence of K draft tokens, generate a tree of candidate continuations and verify the entire tree in parallel. This dramatically increases the chance that some path through the tree has high acceptance.

EAGLE-3 added more architectural improvements, achieving 3-4× speedup on Llama 3 70B with very low overhead.

EAGLE is now the default speculative decoding method in vLLM and SGLang for production deployments where the speedup justifies the setup. The operational cost is a tiny extra model (the EAGLE head, ~50M parameters for a 70B target) that needs to be trained once.

27.8 Multi-token prediction (MTP) and DeepSeek

DeepSeek-V3 took a different angle: bake speculative decoding into the model itself by training the model to predict multiple future tokens at every position. This is Multi-token Prediction (MTP).

The training change: instead of the standard next-token prediction loss, the model is trained with auxiliary losses that predict tokens 2, 3, 4 positions ahead. The model has multiple output heads — one for each prediction horizon. The main prediction (next token) is still the primary signal; the auxiliary heads are supplementary.

At inference, the auxiliary heads can be used to generate speculative tokens directly. No separate draft model is needed; the speculation is part of the model.

DeepSeek-V3 reports a 1.8× decoding speedup from MTP with negligible quality cost. The technique is gaining adoption in other models because it integrates so cleanly: there’s no separate inference path, no extra model to load, no acceptance/rejection complexity. Just sample from multiple output heads and verify with the standard model in parallel.

MTP is the cleanest version of speculative decoding I know about, and I expect more models to adopt it.

27.9 Operational considerations

If you’re going to use speculative decoding in production, here’s what you need to handle:

(1) Memory. The draft model takes additional GPU memory (for vanilla speculative decoding). For self-speculative methods (Medusa, EAGLE, MTP), the overhead is small (the heads are tiny).

(2) Throughput vs latency trade. Speculative decoding improves single-stream latency. But it can hurt aggregate throughput because the draft generation is sequential and adds overhead. In a heavily-batched serving scenario, where the GPU is already saturated by continuous batching, the speedup from speculative decoding is smaller. Speculative decoding is most useful at low concurrency, where the GPU has spare capacity that the draft model can productively use.

(3) Scheduler integration. Continuous batching schedulers have to know how to interleave draft tokens with main tokens. vLLM has first-class speculative decoding support since v0.4. SGLang and TensorRT-LLM also support it. The integration is non-trivial but mature.

(4) Sampling correctness. The mathematical guarantee is that speculative decoding preserves the target distribution. But it only does this if the implementation is correct — getting the rejection sampling right is tricky and there are several papers describing subtle bugs in early implementations. Use a well-tested library.

(5) When to use it. Speculative decoding is most valuable when:

  • You’re memory-bandwidth-bound (low concurrency, big model).
  • You’re at low temperature (high acceptance).
  • You can tolerate the extra complexity in the serving stack.

It’s least valuable when:

  • The GPU is already saturated by batching (no spare compute).
  • You’re at high temperature.
  • The workload has unpredictable next tokens that draft models can’t guess.

27.10 When speculative decoding is not worth it

Honest assessment: in highly-batched production serving, speculative decoding’s benefit is much smaller than the headline numbers suggest. The reason: continuous batching already uses the GPU’s spare compute by stacking many users together. There’s no extra capacity for the draft model to consume usefully — running the draft just slows down the existing batch.

The conditions where speculative decoding helps the most:

  • Single-stream interactive workloads (one user, one request, latency-critical).
  • Reasoning models that generate very long outputs and would benefit from any per-token speedup.
  • Low-traffic deployments where the GPU isn’t fully utilized.

Conversely, the conditions where it doesn’t help:

  • High-throughput batched serving where the GPU is at >70% utilization.
  • Diverse workloads where the draft model’s acceptance rate is low.
  • Very small models where the draft can’t be much smaller than the target.

For a high-traffic chat API serving 70B Llama, vanilla speculative decoding might give a 1.2× speedup. For a low-concurrency reasoning model deployment, it might give 2.5×. Tune your decision to your workload.

27.11 The mental model

Eight points to take into Chapter 28:

  1. Decode is sequential within a single request. Speculative decoding partially breaks this.
  2. The trick is having a small draft model guess multiple tokens, then verifying them in parallel with the target.
  3. The math preserves the target distribution exactly. It’s “free” speedup.
  4. The acceptance rate determines the speedup. Higher at low temperature, on predictable workloads.
  5. Self-speculative variants (Medusa, Lookahead, EAGLE, MTP) eliminate the separate draft model.
  6. EAGLE (especially EAGLE-2 and -3) is the current state-of-the-art for speculative decoding.
  7. MTP (multi-token prediction) bakes speculation into the model itself, as in DeepSeek-V3.
  8. Speculative decoding is most useful at low concurrency. For heavily-batched serving, the benefit is smaller.

In Chapter 28 we look at the parallelism dimensions for inference: tensor parallelism, pipeline parallelism, expert parallelism, sequence parallelism — applied at serving time rather than training time.


Read it yourself

  • Leviathan et al., Fast Inference from Transformers via Speculative Decoding (2023). The original paper.
  • Chen et al., Accelerating Large Language Model Decoding with Speculative Sampling (DeepMind, 2023). The independent parallel work.
  • Cai et al., Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads (2023).
  • Li et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (2024).
  • Fu et al., Lookahead Decoding (2024).
  • The DeepSeek-V3 technical report — the MTP section.
  • The vLLM speculative decoding documentation and source code.

Practice

  1. Walk through the speculative decoding algorithm step by step on a tiny example. Show the accept/reject logic.
  2. Why does speculative decoding preserve the target distribution exactly? Reproduce the proof from §27.3 with your own notation.
  3. Compute the expected speedup for speculative decoding with K=5 and acceptance rate per token of 80%. What about 50%?
  4. Why is the acceptance rate higher at low temperature? Trace through what happens to the target distribution as temperature drops.
  5. What’s the difference between Medusa and EAGLE? Compare the architectural choices.
  6. Why does speculative decoding help less in heavily-batched serving than in single-stream serving? Trace through the GPU utilization in both cases.
  7. Stretch: Implement a tiny vanilla speculative decoding loop in PyTorch using two open models (a 7B target and a 1B draft). Measure the speedup on a real text generation task.

Concept check

4 questions. Click a choice to check. Your score is saved locally.

Score
0 / 4
  1. 1. What is the primary reason speculative decoding can yield multiple tokens per big-model forward pass without changing the output distribution?
  2. 2. Why does speculative decoding provide the largest speedup in the memory-bandwidth-bound decode regime rather than in prefill?
  3. 3. EAGLE differs from classic speculative decoding primarily because it drafts using which architectural approach?
  4. 4. A team is deploying speculative decoding for a long-form document summarization workload where each request generates 4096 output tokens. Which factor most threatens a high acceptance rate?
Related chapters