Latency budgets, tail latency, and the p99 problem
"Your average latency is fine. Your p99 is killing you. They are not the same thing and they never will be"
In Chapter 30 we covered cost. In this chapter we cover the other axis: latency, and specifically the part that matters in production — the tail. Mean latency is what benchmarks show. p99 latency is what users feel and what wakes you up at 3 AM when the SLO breaks. The two diverge much more in LLM serving than in any other system you’ve worked with, because LLM requests are variable-cost in a way that classical web requests are not.
This chapter is about why tail latency is exotic for LLM serving, what determines it, and how to keep it bounded. By the end you’ll know:
- The latency taxonomy: TTFT, TPOT, e2e, queueing.
- Why LLM tail latency is much wider than web service tail latency.
- How continuous batching creates a specific kind of variance.
- How to set SLOs around variable-cost workloads.
- The interaction between latency, throughput, and cost.
Outline:
- The four latencies you must distinguish.
- The mean-vs-tail divergence.
- Why LLM tail latency is exotic.
- Sources of variance: queueing, batch composition, sequence length.
- Little’s Law applied to LLM serving.
- SLO design: per-percentile vs per-request.
- Mitigations: tier-based routing, batch caps, dedicated capacity.
- The ballooning generation problem.
- The “dropped request” tail.
- Tail latency in interview answers.
31.1 The four latencies
When we say “latency” for an LLM request, we usually mean one of four distinct things. Be precise.
(1) Time-to-first-token (TTFT). Wall-clock time from request submission until the first response token is streamed back. This is what the user feels first — the delay before “anything happens.” Dominated by prefill (Chapter 21) plus any queueing time before the request enters the model.
(2) Time-per-output-token (TPOT). Average wall-clock time between successive output tokens during streaming. Once generation starts, how fast is the model emitting tokens? Dominated by decode plus the cost of running attention against the growing KV cache.
(3) End-to-end latency (e2e). Total time from request submission until the last token is delivered. e2e ≈ TTFT + (output_length - 1) × TPOT. The metric your system test cares about.
(4) Queueing time. Time the request spends waiting in the scheduler’s queue before getting picked up. Zero when the system is underloaded; can dominate everything else when the system is overloaded.
Each of these has its own distribution, its own median, and its own tail. A serving stack that has good TTFT mean but bad TTFT p99 is a different problem from one with good TPOT mean but bad TPOT p99. The mitigations are different. The senior move is to track all four percentiles separately.
The user-facing “perceived response time” is some weighted combination: TTFT for “did the model start,” TPOT for “how fast is it streaming,” e2e for “when did I get my answer.” Different products weight these differently. A real-time chat needs low TTFT and TPOT. A document summarizer can tolerate higher TTFT in exchange for fast streaming. A batch classifier doesn’t care about TTFT at all and only cares about e2e.
31.2 The mean-vs-tail divergence
For most web services, the tail latency is roughly 2–5× the median. p99 might be 5× p50. p999 might be 10× p50. These are wide but bounded.
For LLM serving, the tail is much wider. Common ratios:
| Metric | Median | p99 | p999 |
|---|---|---|---|
| TTFT | 200 ms | 800 ms | 3 s |
| TPOT | 50 ms | 150 ms | 500 ms |
| e2e (200-token output) | 10 s | 35 s | 2 minutes |
The p99/p50 ratio for e2e can easily be 5–10×, and the p999/p50 ratio can be 30×+. This is much wider than classical web services.
The reason is that LLM requests are variable-cost in two dimensions:
- Variable input length. A 100-token prompt and a 10000-token prompt have wildly different prefill costs. The latter has 100× more compute and several times more KV cache.
- Variable output length. A request that emits 5 tokens before EOS finishes much faster than one that emits 2000 tokens before hitting the max length. Two requests with the same prompt can produce wildly different output lengths.
Both axes contribute to the long tail. A request that gets a long prompt and generates a long output sits at the extreme tail of the distribution. There’s no clever scheduling that fixes this — the long-output request just takes a long time, period.
Compare to a web service handling uniform requests (e.g., “fetch this database row”): the tail is mostly about queueing and unrelated noise (GC pauses, network blips). The work itself is constant. LLM requests have variable work, and that variable work shows up directly in the latency tail.
31.3 Why LLM tail latency is exotic
The variable-cost nature is the deepest reason, but there are several others:
(1) The continuous-batching effect. In continuous batching (Chapter 23), each step’s wall-clock time is determined by the longest sequence currently in the batch. If your request happens to be in the same batch as a sequence that’s already at 8000 tokens of context, every one of your decode steps takes longer than it would in an unloaded batch. Your latency depends on what else is in the batch.
(2) Prefill blocking. If chunked prefill isn’t enabled, a long-prompt request that arrives can block decode for the entire duration of its prefill. Existing decoding requests pause until the prefill completes. This adds occasional spikes to TPOT.
(3) KV cache pressure. When the KV cache fills, the scheduler may evict in-flight requests (drop-and-recompute). The evicted request gets restarted from scratch, adding huge latency. This is rare in healthy serving but can spike under load.
(4) Tokenizer overhead. For very short responses, the per-token tokenizer cost (which is constant) becomes a noticeable fraction of the total. For very long responses, it amortizes away.
(5) The “stuck in the middle” effect. In a heavily-loaded scheduler, a request might spend significant time queued before being admitted to the batch. This queueing time is invisible from the GPU’s perspective but very visible to the user.
Each of these adds to the tail in a different way. Mitigating tail latency means understanding which sources are driving it for your workload.
31.4 Sources of variance, in detail
Let me drill into each.
Sequence length variance
The cost of a forward pass scales with the sequence length (linearly for cached decode attention, quadratically for prefill attention). A request with a 32k-token context takes ~16× longer per decode step than one with a 2k-token context. Same model, same hardware.
If your workload has a wide range of context lengths, your TPOT distribution has a wide range too. The remedy is to route long-context requests to dedicated capacity so they don’t pollute the latency of short-context requests.
Batch composition variance
Continuous batching means a request’s per-step time depends on the maximum sequence length of the current batch. If you’re a 1k-token request and you happen to be in a batch with a 16k-token request, your per-step time is dominated by the 16k. Bad luck.
The remedy is to cap the maximum sequence in a batch so that no request can be slowed down by an outlier. vLLM has this knob.
Prefill blocking
Without chunked prefill, a new request’s prefill blocks the batch. With chunked prefill, the prefill is split into chunks that interleave with decode steps, so the impact is smoothed out. Always enable chunked prefill in production.
KV cache pressure and eviction
When memory is tight, the scheduler evicts. Evictions cause retries and add 1–10 seconds of tail latency. Symptoms: a few requests with e2e times 5–10× the median.
The remedy is to size the KV cache for peak load, with headroom. If you’re hitting frequent evictions, you’re underprovisioned.
Queueing
When admission to the batch is delayed because the batch is full, the request waits in a queue. Queue time is purely additive to TTFT.
The remedy is to add more replicas when queue lengths grow. Autoscaling on queue length is the right signal (Chapter 51).
31.5 Little’s Law applied
Little’s Law is the most useful systems-engineering theorem for thinking about latency under load. It says:
average_concurrency = arrival_rate × average_latency
Or equivalently:
average_latency = average_concurrency / arrival_rate
This relates three quantities: how many requests are in flight at any time (L), how many arrive per second (λ), and how long each takes on average (W). They’re related by L = λW. This is true for any stable system in steady state.
For LLM serving, the practical implications:
(1) Increasing concurrency increases latency if throughput is fixed. If your GPU can do 50 tokens/sec/user and you push 100 concurrent users through it, each user gets 50 tokens/sec but you’ve doubled the concurrency, which doubles the per-step work, which doubles the per-step time. Per-user TPOT goes up.
(2) The relationship isn’t linear at the limit. As concurrency grows toward the GPU’s capacity, the per-request latency grows nonlinearly. Above 80% utilization, latency starts to spike. Above 95%, latency is unbounded.
(3) There’s an optimal operating point. Run at 60–70% utilization for predictable latency. Above that, you’re trading latency for throughput in a bad ratio.
This is why you should not run your serving fleet at 90%+ utilization. The cost savings are small (10–20%) and the latency cost is huge. Aim for 70%.
31.6 SLO design: per-percentile
The right way to set an SLO is per-percentile, not per-request:
- TTFT p50 < 500 ms (median user experience)
- TTFT p99 < 2 s (worst-case user experience)
- TPOT p50 < 80 ms (streaming feel for median)
- TPOT p99 < 200 ms (streaming feel for worst case)
- e2e p99 < 30 s for outputs up to 1000 tokens
Setting “all requests must be under 500ms” is wrong because LLM requests have a long tail by nature — that SLO will be violated every minute. Setting per-percentile SLOs acknowledges the tail and gives you a budget for it.
The SLO is then enforced through:
- Monitoring that tracks all the percentiles continuously.
- Autoscaling that adds capacity when latency degrades.
- Admission control that rejects requests rather than letting them enqueue beyond the SLO budget.
- Tier-based routing that gives premium-tier requests separate capacity from free-tier.
A serious production deployment has all four. A toy deployment has none and either over-provisions (expensive) or has unpredictable tail latency.
31.7 Mitigations
The toolbox for keeping tail latency bounded:
(1) Tier-based routing. Premium-tier requests run in batches with smaller max sequence lengths and tighter latency caps. Free-tier requests run in the maximally-batched, maximally-utilized batches. This lets you give “good latency for paying customers” without giving up the throughput on free traffic.
(2) Max sequence length per batch. Cap the maximum sequence in any single batch. Long sequences get their own (smaller) batch. This prevents short-request latency from being polluted by long-request batches.
(3) Chunked prefill (Chapter 23). Never let a long prefill block decode.
(4) Dedicated long-context capacity. Route requests with prompts >16k tokens to a separate fleet that’s sized for it. The main fleet stays focused on short-context requests.
(5) Headroom on KV cache. Don’t fill the cache to 100%. Keep 20% free so the scheduler has room for new requests without evicting.
(6) Aggressive autoscaling on queue length. When the queue grows, add capacity quickly. Don’t wait for the queue to clear naturally.
(7) Generation length caps. Set a hard maximum on output length. A request that generates 4000 tokens when 200 was reasonable is consuming resources for everyone else; cap it.
(8) Speculative decoding for low-concurrency. When the batch is small (low traffic), speculative decoding helps single-request latency at the cost of some compute. Useful for keeping latency bounded during off-peak.
The combination of these gets you predictable tail latency. None of them is sufficient alone.
graph TD
Start([High tail latency observed]) --> Q1{Which metric is elevated?}
Q1 -->|TTFT p99 high| Q2{Long queue times?}
Q1 -->|TPOT p99 high| Q3{Long context in batch?}
Q1 -->|e2e p999 very high| Q4{Ballooning generation?}
Q2 -->|Yes| Fix1[Add replicas / autoscale on queue length]
Q2 -->|No| Fix2[Enable chunked prefill to stop long prefills blocking decode]
Q3 -->|Yes| Fix3[Cap max sequence per batch; route long-context to dedicated fleet]
Q3 -->|No| Fix4[Check KV cache evictions — size cache for peak load]
Q4 -->|Yes| Fix5[Set max_tokens, stop sequences, server-side timeout]
Q4 -->|No| Fix6[Check dropped/evicted requests — increase KV cache headroom]
style Start fill:var(--fig-surface),stroke:var(--fig-border)
style Fix1 fill:var(--fig-surface),stroke:var(--fig-border)
style Fix2 fill:var(--fig-surface),stroke:var(--fig-border)
style Fix3 fill:var(--fig-accent-soft),stroke:var(--fig-accent)
style Fix4 fill:var(--fig-surface),stroke:var(--fig-border)
style Fix5 fill:var(--fig-surface),stroke:var(--fig-border)
style Fix6 fill:var(--fig-surface),stroke:var(--fig-border)
Diagnose tail latency by metric first, then cause — each path leads to a different mitigation; no single fix works for all cases.
31.8 The ballooning generation problem
A specific failure mode worth its own section: ballooning generation.
The model is supposed to emit EOS when it’s done. Sometimes it doesn’t — it just keeps going, producing text indefinitely until it hits the max_tokens limit. This can be a model bug, a sampling issue (high temperature with low repetition penalty causing loops), or a prompt issue (an open-ended prompt with no clear stopping point).
When this happens, a request that was supposed to take 5 seconds takes 60 seconds. If you have many such requests, your tail latency explodes.
Mitigations:
- Set
max_tokensaggressively. For chat: 1000 or 2000. For specific tasks: tune to the task. Don’t let it default to 4096. - Use stop sequences. If you know the model should stop after a specific token sequence, configure stop sequences in the API call.
- Detect repetition loops. Some serving stacks have detectors that abort generation when the model is generating the same token (or n-gram) repeatedly.
- Server-side timeouts. Hard cap the wall-clock time per request. After 60 seconds of generation, abort and return what you have.
These are mundane operational measures, but they’re what separates a serving stack that has bounded tail latency from one that has unbounded tail latency. Ballooning generation is one of the top three sources of LLM serving incidents.
31.9 The “dropped request” tail
Another tail-latency phenomenon: requests that are evicted mid-generation and have to be retried. We covered this in Chapter 22 (KV cache management) and Chapter 23 (continuous batching).
When the KV cache fills and the scheduler decides to evict, the evicted request loses its in-progress KV cache. vLLM’s default is “drop and recompute” — the request is requeued, and when it’s eventually re-scheduled, the entire prompt is re-prefilled from scratch. The total time for the request is now original_processing_time + queue_time + re-prefill_time + decode_time.
For a request that was 30% through its generation when evicted, this can mean a 2–3× increase in e2e latency. It shows up as a sharp tail.
Mitigations:
- Size the KV cache for the workload. Evictions should be rare; if they’re common, you’re underprovisioned.
- Fair scheduling. Don’t always evict the same long-context requests; rotate.
- CPU offload as fallback. Instead of dropping, copy the KV cache to CPU memory and resume later. Slower than running but faster than re-prefill.
In a healthy production deployment, evictions are <0.1% of requests. If you see more than that, fix the provisioning.
31.10 Tail latency in interviews
The senior interview move on tail latency is to:
- Distinguish the four latencies (TTFT, TPOT, e2e, queueing).
- Explain why LLM latency has a long tail (variable-cost requests, batch composition variance).
- Cite Little’s Law when discussing concurrency vs latency.
- Talk about per-percentile SLOs, not blanket “must be under X.”
- Mention the specific mitigations (tier-based routing, max-seq caps, chunked prefill, generation caps, autoscaling).
- Acknowledge the throughput-latency trade-off explicitly.
- Bring up ballooning generation as a top failure mode.
A candidate who can articulate this is operating at a senior systems level. A candidate who only talks about “average latency” is mid-level.
31.11 The mental model
Eight points to take into Chapter 32:
- Four latencies: TTFT, TPOT, e2e, queueing. Each has its own distribution.
- LLM tail latency is wide because requests are variable-cost in input length and output length.
- Continuous batching creates batch-composition variance. Your latency depends on what else is in the batch.
- Little’s Law:
L = λW. Increasing concurrency at fixed throughput increases latency. - Run at 60–70% utilization. Above that, latency spikes nonlinearly.
- Set per-percentile SLOs. A “must be under X” SLO is wrong.
- Tier-based routing, max-seq caps, chunked prefill, generation caps. The mitigation toolbox.
- Ballooning generation is the most common avoidable cause of tail latency.
In Chapter 32 we wrap up Stage 2 of Part III with multimodal: how vision-language and audio models change the inference picture.
Read it yourself
- The Tail At Scale (Dean & Barroso, 2013). The classic paper on tail latency in large systems. Read it for the framework.
- Little’s Law (Little, 1961). The 1-page paper that launched a thousand queueing-theory chapters.
- The vLLM benchmarking documentation, which reports per-percentile latency.
- The Anyscale blog post on continuous batching, which discusses the latency variance trade.
- Brendan Gregg’s Systems Performance book, chapter on latency. Not LLM-specific but broadly applicable.
Practice
- For a 70B model on a single H100, with 50 concurrent users at 50 tokens/sec/user, compute the steady-state TPOT. Use Little’s Law.
- Why does running at 90% utilization give much higher tail latency than running at 70%? Plot the queueing curve.
- A request has TTFT p50 = 200ms and TTFT p99 = 1.5s. What does this tell you about the source of variance? How would you debug it?
- Design a per-percentile SLO suite for a chat application with 70B Llama. Cover TTFT, TPOT, and e2e.
- Why does tier-based routing reduce tail latency for premium tier? Trace through the batch composition logic.
- A user reports “the chatbot sometimes takes 60 seconds to respond.” You check the logs and see e2e p999 = 60s. What’s likely happening, and what do you check first?
- Stretch: Run vLLM with a mixed workload of short and long requests. Measure TPOT separately for each group. Verify that the long-request batch composition affects short-request latency.
Concept check
4 questions. Click a choice to check. Your score is saved locally.
- 1. A production chat service has good mean TTFT but very high p99 TTFT. Which root cause most likely explains this pattern?
- 2. Little's Law states L = lambda * W. In the context of LLM serving, what do L, lambda, and W represent?
- 3. Why does end-to-end latency variance in LLM serving grow much wider than in a typical web service under similar load?
- 4. A team sets their TTFT SLO at p95 less than 1s and their TPOT SLO at p99 less than 50ms. A new workload introduces 10% of requests with 8k-token prompts. Which SLO is most at risk and why?