Prefix caching, prompt caching, radix attention
"In production LLM serving, almost every request shares its first thousand tokens with another request. The KV cache for those tokens should be computed once"
In Chapter 22 we showed that the KV cache dominates inference memory, and in Chapter 24 we showed that PagedAttention enables the KV cache to be shared across requests via copy-on-write. In this chapter we close the loop: when many requests share a common prefix (a system prompt, a RAG context, a few-shot example block), we can share the KV cache for that prefix across all of them, computing it once and reusing it for every request that has the same prefix.
This is prefix caching, and it’s one of the largest production wins available to LLM serving in 2024-25. For workloads with long shared prefixes — basically every chat application — prefix caching reduces prefill time by 50–95% on cache hits and slashes overall serving cost.
This chapter covers the basic prefix caching idea, the SGLang/RadixAttention approach that makes it really shine, and the production patterns for cache management.
Outline:
- The shared-prefix observation.
- The basic prefix cache.
- Hashing and lookup.
- RadixAttention — SGLang’s tree-structured cache.
- Cross-batch and cross-request reuse.
- Eviction policies.
- Prefix cache + paged attention + continuous batching.
- The Anthropic prompt cache API and OpenAI’s prompt cache.
- When prefix caching doesn’t help.
- Operational considerations.
29.1 The shared-prefix observation
Look at any production chat or RAG workload. The prompts have structure:
[System prompt: 500 tokens — same for every request]
[RAG context: 2000 tokens — varies by query but common across users with same context]
[Few-shot examples: 500 tokens — sometimes shared across templates]
[User question: 50 tokens — unique per request]
Of those ~3000 prompt tokens, ~2900 are shared with at least one other recent request (the system prompt is shared with every request; the RAG context is shared with everyone querying the same document set; only the user question is truly unique).
Without prefix caching, you compute the KV cache for those 2900 tokens fresh for every single request. Each request prefills 3000 tokens, even though 97% of them are identical to the previous request. The wasted work is enormous.
With prefix caching, the engine notices that 2900 of the 3000 input tokens match the cached prefix from a previous request. The KV cache for those 2900 tokens is already in HBM. The engine just needs to:
- Reuse the cached blocks for the shared prefix.
- Compute the KV cache for only the 100 new tokens (the new user question).
- Continue generation as usual.
The savings: instead of prefilling 3000 tokens, you prefill 100 tokens. That’s a 30× reduction in prefill compute, which is essentially a 30× reduction in time-to-first-token for cache-hit requests.
For RAG-heavy applications where the RAG context is the bulk of the prompt, the cache hit rate can be 60–90% of incoming requests. The aggregate cost savings are similar.
This is why prefix caching is one of the most important production optimizations.
29.2 The basic prefix cache
The simplest prefix caching scheme:
- After every prefill, save the resulting KV cache blocks, indexed by the token sequence that produced them.
- When a new request arrives, check if its prompt prefix matches a cached sequence. If yes, reuse the matching prefix’s KV cache and only prefill the suffix.
- Evict old entries when the cache fills up (LRU is the default).
The data structure is a map from token sequences to KV cache blocks. The key is the sequence; the value is a list of (paged) KV blocks. When you look up a new request, you walk down the prompt one block at a time and check if the first block_size tokens match any cached entry, then the first 2 × block_size, and so on.
Because of PagedAttention’s block structure (Chapter 24), the prefix cache works at block granularity, not token granularity. A request that shares the first 47 tokens of a prompt with a cached entry only gets the first 47 / block_size = 2 blocks (32 tokens with block_size=16) reused — the third block doesn’t fully match, so it’s recomputed.
This is fine in practice because real workloads share prefixes much longer than the block boundary.
29.3 Hashing and lookup
How do you efficiently check “is this prefix cached?” Hashing.
Each cached prefix is keyed by a hash of its token sequence. When a new request arrives:
- Compute a rolling hash of the prompt token IDs.
- For each block boundary, check if the hash up to that point matches any cached entry.
- The longest matching prefix wins.
vLLM uses this approach. The hash is computed incrementally as the prompt is processed, and the cache lookup is O(prompt_blocks) time. For a 3000-token prompt with block_size=16, that’s 188 hash lookups — completely negligible compared to the prefill compute.
The hash needs to be stable across processes so that cached entries from previous requests are findable. vLLM uses a deterministic hash based on the token IDs and the block_id (so different blocks of the same content hash to different things).
29.4 RadixAttention — SGLang’s approach
SGLang (Zheng et al., 2023) introduced RadixAttention, which is a more sophisticated prefix cache based on a radix tree (a.k.a. compressed prefix trie).
The idea: instead of a flat hash map of cached prefixes, store them in a tree where each path from root to a node represents a cached prefix. When a new request arrives, walk down the tree until you can’t match any further; the path you walked is the longest cached prefix.
The tree structure has several advantages over a flat hash map:
(1) Multiple users sharing different prefixes are handled cleanly. If user A’s prompt shares the first 1000 tokens with user B’s prompt, but then they diverge, the tree has a single shared path for the first 1000 tokens and then branches. Both users hit the cache for the shared part.
(2) The branching is automatic. When a new prefix is added that diverges from an existing one mid-way, the tree splits a node and creates a new branch. You don’t have to explicitly manage the divergence.
(3) Cache hits are nested. A request that matches “the first 2000 tokens of cached entry X” automatically also matches “the first 1000 tokens of cached entry Y” if X and Y share the first 1000 tokens. The radix tree finds the longest match efficiently.
(4) Memory is shared at the block level via PagedAttention. The radix tree nodes point to physical KV blocks; multiple paths through the tree can point to the same blocks (the same block-pool sharing as in vanilla PagedAttention).
SGLang’s RadixAttention is the fastest prefix-caching implementation as of late 2025. For workloads with deeply nested shared prefixes (think: a long system prompt + RAG context with multiple users querying the same set of documents), it can achieve cache hit rates above 90%.
The vLLM team has since adopted similar ideas, and the gap between vLLM and SGLang on prefix caching has narrowed significantly. Both are now production-quality.
29.5 Cross-batch and cross-request reuse
A subtle but important point: prefix caching works across requests over time, not just within a single batch.
Without caching, each request’s KV cache lives only as long as the request. When the request finishes, its cache is freed. The next request starts from scratch.
With caching, the KV cache persists in the cache pool after the request finishes (if there’s memory for it). The next request that has the same prefix can reuse the cached blocks.
This is the property that makes prefix caching so effective for real workloads. A typical chat application has a fixed system prompt that gets used by every request, indefinitely. Without cross-time reuse, you’d recompute that prefix for every request forever. With it, the system prompt’s KV cache lives in the cache from the moment the first request arrives until eviction (which should never happen if the cache is sized for the system prompt).
The cache is essentially a content-addressed key-value store for KV blocks. The “key” is the prefix sequence; the “value” is the (paged) KV cache. The size of this store determines how many distinct prefixes you can cache at once.
29.6 Eviction policies
When the cache is full and a new prefix needs to be added, something has to be evicted. The standard policies:
graph LR
Req[New request arrives] --> Hash[Compute rolling hash of prompt blocks]
Hash --> Lookup{Cache hit?}
Lookup -->|hit| Reuse[Reuse KV blocks from cache]
Lookup -->|miss| Prefill[Prefill the suffix tokens]
Reuse --> Decode[Begin decode]
Prefill --> Store[Store new KV blocks in cache]
Store --> Decode
Decode --> Done[Return response]
Store --> Evict{Cache full?}
Evict -->|yes| LRU[Evict LRU blocks]
Evict -->|no| Done
style Reuse fill:var(--fig-accent-soft),stroke:var(--fig-accent)
On a cache hit the entire prefix prefill is skipped — the engine jumps straight from hash lookup to decode, slashing TTFT by the fraction of the prompt that was cached.
LRU (least recently used). Evict the prefix that hasn’t been used in the longest time. Simple, effective, and works well for typical workloads where recently-used prefixes are likely to be reused.
LFU (least frequently used). Evict the prefix that has been used the fewest times overall. Better for stable workloads with a fixed set of important prefixes.
Hybrid LRU+LFU. Combine recency and frequency signals. Used by some production caches.
Reference counting. Don’t evict prefixes that are currently being used by an in-flight request. This is necessary for correctness — you can’t free a block that’s actively being read.
vLLM uses LRU with reference counting. The eviction is triggered when a new request needs more cache space than is available.
29.7 Prefix cache + paged attention + continuous batching
The three techniques compose into the modern serving stack:
- PagedAttention (Chapter 24) provides the block-level KV cache storage that enables sharing.
- Continuous batching (Chapter 23) enables many requests to share a single forward pass.
- Prefix caching enables those requests to also share KV cache blocks for their common prefixes.
A single decode step in vLLM with all three:
- Schedule (continuous batching): pick which sequences to run this step. Each sequence may be at a different generation position.
- Prefix lookup (prefix caching): for newly-admitted sequences in this step, check if their prefix is in the cache. If yes, reuse the cached blocks instead of prefilling.
- Allocate blocks (PagedAttention): for tokens that aren’t cached, allocate physical blocks from the pool.
- Forward pass (FlashAttention + paged kernel): compute the attention against the per-sequence block tables, including the cache-hit blocks.
- Update cache: after the forward pass, the new K and V values for the suffix tokens are written into newly-allocated blocks. The shared prefix blocks are not modified (COW protects them).
- Sample: produce next token for each sequence.
- Free (eviction policy): free blocks for finished sequences. Some blocks may be retained in the prefix cache for future hits.
This is the production loop. Every modern serving stack does some version of this.
29.8 The Anthropic / OpenAI API-level cache
Two of the biggest API providers have made prefix caching a public feature:
Anthropic’s prompt caching (2024) lets API users mark portions of their prompt as “cacheable.” Anthropic’s serving infrastructure stores the KV cache for marked content for ~5 minutes (or longer with extended cache). Subsequent requests with the same cacheable prefix get a substantial discount on input token pricing — typically 90% off — and a faster response.
The user has to opt in by tagging the cacheable parts of the prompt with explicit cache control headers. This is an unusual API design (most APIs hide caching from the user), but it gives the user control over what to cache and gives Anthropic predictability about cache hit patterns.
OpenAI’s prompt caching (2024) works automatically: if the same prompt prefix is used twice within a few minutes, the second request gets a discount and faster response. No tagging required. Less control, but simpler to use.
Both use the same underlying technique as vLLM/SGLang: persistent block-level KV cache with prefix matching. The difference is the API contract — explicit (Anthropic) vs implicit (OpenAI).
For self-hosted serving, you get the same benefit “for free” through vLLM or SGLang as long as your workload has shared prefixes.
29.9 When prefix caching doesn’t help
Prefix caching helps when prompts share long prefixes. It doesn’t help when:
- Every prompt is unique. Truly unique prompts (e.g., a search engine where every query is different) don’t benefit. Cache hit rate is 0%.
- Prompts share only short prefixes. If the prompts share less than
block_sizetokens, the cache lookup misses by default. - The “shared” parts have small variations. Two prompts that look similar but differ in a few words don’t share a prefix unless the differences are at the very end. Even one different token at position 100 invalidates the cache for everything after position 100.
- The cache is too small. If the prefix cache is smaller than the working set of unique prefixes, eviction will kick in and hit rates will be low.
- Generation is the bulk of the work. If a request’s prefill is small relative to its decode (e.g., 100-token prompt → 2000-token response), the prefill savings are small in absolute terms.
For RAG, multi-tenant chat with system prompts, code completion (where the file context is shared across edits), and few-shot reasoning, prefix caching is enormously valuable. For pure search-engine-style workloads, it adds little.
The leading-prefix-only constraint is sometimes a real annoyance. If the cacheable content is in the middle of the prompt (e.g., a fixed instruction comes after a user-specific intro), the cache won’t help. The fix is to reorder the prompt so the cacheable parts come first. This is one of the design tips you’ll see in prompt engineering guides for systems with prefix caching.
29.10 Operational considerations
If you’re running vLLM or SGLang with prefix caching enabled, things to know:
(1) It’s enabled by default in modern versions (vLLM v0.4+, SGLang v0.2+). You don’t have to do anything to turn it on.
(2) The cache lives in GPU HBM, sharing the budget with the active request KV cache. There’s a trade-off: more space for the prefix cache means less space for active requests. Tune via the --gpu-memory-utilization flag in vLLM.
(3) Hit rate is the key metric. vLLM exposes a cached_tokens_total Prometheus metric that you can use to compute the hit rate. A healthy production deployment should see >50% hit rate for chat applications.
(4) Cache hits dramatically reduce TTFT but not TPOT. If your latency budget is decode-dominated, prefix caching helps less. If it’s prefill-dominated (RAG, long contexts), prefix caching is huge.
(5) Evictions can cause latency variance. When a cache miss happens after a cache hit (because the relevant prefix was evicted), the request takes much longer than the usual case. This adds tail latency variance.
(6) Prefix caching interacts with chunked prefill. Modern vLLM does both: chunk the new (uncached) suffix into pieces and process them in chunked-prefill mode. The cached prefix is loaded directly from the cache without re-prefilling.
29.11 The mental model
Eight points to take into Chapter 30:
- Real workloads have huge shared prefixes. System prompts, RAG contexts, few-shot examples — all repeated across requests.
- Prefix caching reuses the KV cache for shared prefixes instead of recomputing it.
- Block-level caching via PagedAttention is the foundation. The prefix cache is a content-addressed map from token sequences to KV blocks.
- RadixAttention (SGLang) uses a radix tree for efficient nested prefix matching. vLLM has similar capabilities now.
- Cross-time reuse is the killer feature. The cache persists across requests, so common prefixes are computed once and reused indefinitely.
- LRU with reference counting is the standard eviction policy.
- Anthropic and OpenAI expose prompt caching via their APIs. Self-hosted serving gets it for free with vLLM or SGLang.
- Prefix caching helps prefill, not decode. It’s a TTFT optimization, not a TPOT optimization.
In Chapter 30 we shift from kernel-level optimizations to cost modeling: how to compute the actual dollar cost of serving an LLM.
Read it yourself
- Zheng et al., SGLang: Efficient Execution of Structured Language Model Programs (2023). The RadixAttention paper.
- The vLLM blog post on prefix caching.
- The Anthropic documentation on prompt caching.
- The OpenAI documentation on prompt caching.
- The vLLM source code, especially
vllm/core/block_manager.pyfor the cache logic. - The SGLang documentation on RadixAttention configuration.
Practice
- Compute the prefix cache savings for a chat application with a 1500-token system prompt, a 1000-token RAG context (shared across 10 users), and a 100-token user question. What’s the prefill compute saved on average?
- Why does prefix caching only work for the leading prefix, not arbitrary substrings? Trace through the KV cache structure.
- A new request shares the first 47 tokens with a cached prefix. With
block_size=16, how many tokens does the cache reuse? - Why does RadixAttention’s tree structure handle nested prefixes more efficiently than a flat hash map? Construct an example with three users sharing different overlapping prefixes.
- In vLLM, the prefix cache shares HBM with the active KV cache. If the cache is full, what happens when a new request needs space?
- Design a prompt template for a chat application that maximizes prefix cache hits. Where should you put the user’s input?
- Stretch: Run vLLM on a small model with prefix caching enabled. Send 100 requests with the same 500-token system prompt and varying user queries. Measure the TTFT for the first request vs the rest. Verify the cache hit speedup.
Concept check
4 questions. Click a choice to check. Your score is saved locally.
- 1. What is the primary data structure that RadixAttention (SGLang) uses to allow many requests with partially overlapping prefixes to share cached KV entries?
- 2. Which eviction policy is most appropriate for a prefix cache that serves a chat application with a fixed system prompt shared by all users?
- 3. Prefix caching reduces prefill TTFT only on cache hits. Which workload pattern produces the highest cache hit rate?
- 4. A serving system uses prefix caching with PagedAttention. A new request partially matches a cached prefix but the last cached block contains tokens that differ from the new request. What does the system do?