PagedAttention and vLLM as a virtual-memory system for KV cache
"Every great systems idea is the same: replace contiguous allocation with paging, and the world gets better"
In Chapter 22 we showed that the KV cache dominates GPU memory in any non-trivial LLM deployment. In Chapter 23 we showed that continuous batching is the way to keep many users in flight on the same GPU. PagedAttention is the data structure that makes both of those work in practice. Without PagedAttention, the KV cache management problem in a continuous-batching scheduler is intractable. With it, vLLM became the dominant serving stack basically overnight.
This chapter goes deep on PagedAttention: what problem it solves, how it works, why the analogy to operating-system virtual memory is exact, and what features it enables (efficient batching with mixed sequence lengths, prefix sharing, copy-on-write).
Outline:
- The KV cache memory fragmentation problem.
- The OS analogy: physical memory, virtual memory, paging.
- PagedAttention — the basic idea.
- Block tables.
- The kernel side — how attention runs against blocked KV cache.
- Internal vs external fragmentation.
- Copy-on-write for prefix sharing.
- Eviction.
- Why PagedAttention is the foundation of modern LLM serving.
24.1 The fragmentation problem
Without PagedAttention, the KV cache for each in-flight sequence is stored as one contiguous tensor per layer that can hold the maximum sequence length the model supports. For Llama 3 70B with max_seq_len = 32768, each request reserves:
2 × 80 × 8 × 128 × 32768 × 2 bytes ≈ 10.5 GB per request
That’s 10.5 GB per request, regardless of how long the actual sequence is. A request with a 200-token prompt and a 100-token completion still gets 10.5 GB allocated to it, of which it uses about 0.1 GB. 99% of the allocated memory is wasted.
This is internal fragmentation in the OS sense: each allocation is bigger than what it actually needs. The fix in operating systems was paging, which lets you allocate memory in fixed-size pages and map them into a logical address space on demand.
PagedAttention applies the same idea to KV cache. Instead of one big contiguous tensor per request, the KV cache is broken into fixed-size blocks (typically 16 tokens per block), and each request allocates blocks as it grows.
The savings:
- A short request (300 tokens) allocates 19 blocks, ~95 MB instead of 10.5 GB. 100× savings.
- A long request (4000 tokens) allocates 250 blocks, ~1.3 GB instead of 10.5 GB. 8× savings.
- The total GPU memory available for KV cache is now usable across many concurrent requests instead of being locked up in over-provisioned contiguous tensors.
This is the core win. PagedAttention turns the KV cache budget from “how many max-length-pre-allocated requests fit” (very few) into “how many actually-used blocks fit” (many). Practical throughput improvements from PagedAttention alone are typically 2–4× on real workloads.
24.2 The OS analogy
The PagedAttention paper (Kwon et al., 2023) makes the analogy explicit. The mapping:
| Operating system | PagedAttention |
|---|---|
| Physical memory | GPU HBM |
| Process | Sequence (request) |
| Virtual address space | Logical KV cache for a sequence |
| Page | KV cache block (16 tokens) |
| Page table | Block table (per sequence) |
| Page fault | Allocate a new block when sequence grows |
fork() (copy-on-write) | Beam search, prefix sharing |
| Swapping | KV cache offload to CPU/NVMe |
Read the table line by line. Every concept maps cleanly. This is not a loose metaphor — it’s a structural analogy. The same arguments that justify virtual memory in operating systems (fragmentation reduction, dynamic allocation, sharing via COW) justify PagedAttention in LLM serving.
The “block size” choice in PagedAttention is exactly the page-size choice in an OS: small blocks reduce internal fragmentation but increase metadata overhead, large blocks the opposite. The default block size in vLLM is 16 tokens. This is small enough that even short sequences don’t waste much, and large enough that the per-block bookkeeping is cheap.
24.3 PagedAttention — the basic idea
The core data structures:
Physical KV blocks. A pool of fixed-size blocks in GPU HBM. Each block holds the K and V vectors for block_size consecutive tokens, for one layer of one head. For Llama 3 70B with block_size = 16:
Per block: 2 (K, V) × 16 tokens × 8 KV heads × 128 d_h × 2 bytes = 64 KB per layer
Total per block (all layers): 64 KB × 80 layers = 5.12 MB
A 16 GB free KV budget on a GPU has 16 GB / 5.12 MB ≈ 3000 blocks. 3000 blocks × 16 tokens per block = 48,000 tokens of KV cache total, distributable across however many requests want them.
Block tables. Each in-flight sequence has a per-sequence “block table” — a list of physical block IDs in the order they belong in the logical sequence. So sequence A’s block table is [7, 12, 3, 41, ...], meaning “the first 16 tokens of sequence A are in physical block 7, the next 16 are in physical block 12, …”
The logical view of the sequence is the concatenation of the blocks in block-table order. The physical layout in HBM is whatever the allocator chose; the blocks might be scattered all over.
When the sequence grows by one token, the engine checks if the current last block has room (there are positions left in it). If yes, append the new K and V to the existing block. If no, allocate a new block from the free pool, add it to the block table, and start filling it.
When a sequence finishes, its block table is freed and all its physical blocks return to the pool.
This is exactly how an OS manages a process’s address space.
24.4 The block table in detail
Let’s make this concrete. Suppose block_size = 4 (small for the example) and we have a single sequence of 13 tokens.
Logical positions 0..12: [t0 t1 t2 t3 | t4 t5 t6 t7 | t8 t9 t10 t11 | t12]
Logical block index: 0 1 2 3
The sequence needs 4 blocks (the last one only has 1 token in it). Suppose the allocator gives us physical blocks 7, 12, 3, 41:
Block table: [7, 12, 3, 41]
Physical block 7 contains K, V for tokens 0..3
Physical block 12 contains K, V for tokens 4..7
Physical block 3 contains K, V for tokens 8..11
Physical block 41 contains K, V for token 12 (and 3 unused slots)
To compute attention for this sequence, the kernel needs to read K and V from positions 0..12. It uses the block table to translate logical positions to physical addresses:
- Logical position 5 → block table[5 / 4] = block table[1] = physical block 12, offset 5 % 4 = 1
- Logical position 10 → block table[10 / 4] = block table[2] = physical block 3, offset 10 % 4 = 2
The kernel reads each token’s K and V from the corresponding physical block. The reads are scattered across HBM (because the physical blocks aren’t contiguous), but the access pattern is predictable from the block table.
This is paging applied to KV cache. The logical view is contiguous; the physical layout is not. The block table is the page table.
24.5 The kernel side
The PagedAttention paper had to write a custom CUDA kernel to make this efficient. A naive implementation of “read K and V from a list of arbitrary physical blocks” would have terrible memory access patterns — random reads across HBM are much slower than sequential reads.
The vLLM PagedAttention kernel does something clever. The key insight: even though the per-sequence access is scattered, the per-block access is contiguous. Each physical block holds 16 tokens worth of K and V in a tight packed layout. When the kernel reads a block, it reads the entire block sequentially.
The kernel structure:
- For each query token position:
- Look up the sequence’s block table to find which blocks contain its K and V context.
- For each block in the sequence:
-
Load the block into shared memory (fast, sequential read). -
Compute the attention scores between Q and the K's in this block. -
Compute the partial attention output: scores × V's. - Combine the per-block partial outputs into the final attention output.
This is essentially the FlashAttention pattern (Chapter 25) extended to operate on a block-indexed KV cache. Each physical block is one “tile” of the FlashAttention computation. The math is the same; the memory access is via block tables.
The performance: PagedAttention’s kernel achieves throughput within ~5% of FlashAttention on contiguous KV cache, despite the indirection through the block table. The block table lookup is small enough that it doesn’t bottleneck the kernel.
24.6 Internal vs external fragmentation
Two kinds of memory waste, both familiar from OS design:
Internal fragmentation is when an allocation is larger than what’s actually used. For PagedAttention, internal fragmentation happens at the last block of each sequence. If a sequence has 13 tokens and block_size = 16, the last block has 3 unused slots. On average, each sequence wastes block_size / 2 = 8 slots in its last block.
For 100 concurrent sequences with block_size = 16, the total internal fragmentation is 100 × 8 = 800 token slots, or about 800 × 320 KB = 256 MB. This is the cost of paging — you trade flexible allocation for some last-block waste.
The trade-off is the block size choice:
- Small block size (4 or 8 tokens) → less internal fragmentation, more block-table overhead, more kernel-launch overhead.
- Large block size (32 or 64 tokens) → more internal fragmentation, less metadata overhead.
vLLM defaults to 16, which is empirically the sweet spot.
External fragmentation is the opposite problem: free memory is scattered in small pieces such that no contiguous allocation fits even though the total free memory is large. Paging eliminates external fragmentation entirely because all blocks are the same size — any free block can be used for any new allocation. This is one of the deepest wins of paged systems and one of the reasons PagedAttention is so much better than the contiguous-allocation alternative.
24.7 Copy-on-write for prefix sharing
Here’s where PagedAttention starts to shine. Suppose two sequences share a common prefix — for example, both start with the same 1000-token system prompt. With contiguous KV cache, you’d compute and store the prefix’s K and V twice, once per sequence. With PagedAttention, you can share the blocks.
The mechanism is copy-on-write (COW), the same idea Unix uses for fork():
- When the second sequence is created with the shared prefix, instead of copying the prefix’s KV cache, you make its block table point to the same physical blocks as the first sequence’s prefix.
- Both sequences now share the prefix’s KV cache. They each have their own block table, but the entries for the prefix point to the same physical blocks.
- Each block has a reference count. Shared blocks have refcount > 1.
- When one of the sequences wants to modify a block (e.g., it diverges from the prefix and starts generating its own tokens), the engine copies the block before modifying it. The original block is left alone (still referenced by the other sequence), and the modifying sequence gets a new block.
This is exactly how Unix fork() works: the parent and child share memory pages until one of them writes, at which point a copy is made.
The savings for prefix sharing are enormous in production. A typical chat application has:
- A 500-token system prompt (shared across all users)
- A 1000-token RAG context (often shared across requests in a session)
- A 200-token few-shot example block (sometimes shared)
If 100 concurrent users all share the same 500-token system prompt, the naive cost is 100 × 500 = 50,000 token slots of KV cache for the prompt alone. With PagedAttention COW, it’s 500 + 100 × 0 (the prefix is stored once, plus zero per user for the prompt). 100× reduction in prefix storage.
This is the foundation of prefix caching (Chapter 29), which extends the COW idea to a global cache: not just within a batch, but across batches and across time. We’ll get to it.
24.8 Beam search and other branching uses
Beam search (Chapter 8 — mostly dead for chat but alive in some tasks) needs to maintain multiple hypotheses in parallel, all branching from the same prefix. With contiguous KV cache, each beam copies the prefix’s cache, which is wasteful. With PagedAttention COW, each beam shares the prefix blocks and only copies them when its tokens diverge.
Same idea, same benefit. PagedAttention makes branching generation patterns (beam search, parallel sampling, tree-of-thoughts) cheap, where they’d be very expensive with contiguous cache.
This is one of the reasons modern serving stacks support advanced sampling strategies: the cost of branching is low because of the underlying paged memory model.
24.9 Eviction
When the block pool is full and a new request arrives, the engine has to either reject the request, queue it, or evict an existing sequence to make room. The eviction policy is part of the scheduler (Chapter 23).
vLLM’s default policy: drop and recompute. When a sequence is evicted:
- Free all its blocks back to the pool.
- Mark the sequence as “swapped out” but remember its token IDs.
- When the scheduler later decides to resume the sequence, allocate fresh blocks and re-prefill the entire sequence from scratch.
The cost of re-prefill is extra compute (one extra prefill per evicted sequence) but no quality loss — the sequence resumes at the same point. This is the simplest eviction strategy and works well when memory pressure is occasional.
The more sophisticated alternative is swapping: copy the evicted sequence’s blocks to CPU memory, free the GPU blocks, and copy them back when the sequence is resumed. This avoids re-prefill but has the cost of the GPU↔CPU memory transfer, which is slow over PCIe (Chapter 1’s PCIe vs HBM argument). Swapping is sometimes worth it for very long sequences where re-prefill cost would be huge.
vLLM supports both: the default is recompute, with optional swap.
24.10 Why PagedAttention is the foundation
Let’s step back and look at what PagedAttention enables:
graph LR
PA[PagedAttention] --> CB[Continuous batching\nwith mixed lengths]
PA --> PS[Prefix sharing\nvia COW]
PA --> PC[Prefix caching\nacross batches]
PA --> Evict[Granular eviction\nand swapping]
PA --> Sched[Flexible per-block\nscheduling]
style PA fill:var(--fig-accent-soft),stroke:var(--fig-accent)
PagedAttention is the foundation: every feature of modern LLM serving that requires flexible KV cache management builds directly on its block-table abstraction.
- Continuous batching with mixed sequence lengths. Without paging, every sequence in the batch needs the same maximum length allocated. With paging, sequences allocate only what they need. This is what makes Chapter 23’s continuous batching memory-efficient.
- Prefix sharing via COW. Multiple sequences share the same prefix blocks until they diverge.
- Prefix caching across batches. A global cache of prefix blocks lets new requests skip prefill on common prefixes (Chapter 29).
- Efficient eviction and swapping. The granular block structure lets the scheduler free or move individual blocks instead of whole sequences.
- Flexible scheduling. The scheduler can make per-block decisions about what to keep, swap, or evict, which is much more nuanced than per-sequence decisions.
Every modern serving stack — vLLM, SGLang, TensorRT-LLM (through their --use_paged_context_fmha flag), TGI — uses some variant of paged KV cache. The PagedAttention paper essentially defined the production standard.
The “vLLM = state of the art” position rests on three things: PagedAttention (Chapter 24), continuous batching (Chapter 23), and the flag-by-flag tunability of vLLM that lets operators adjust scheduling, memory, and sampling behavior for their workload (Chapter 48). The first two are the heart; the third is the operational interface.
24.11 The mental model
Eight points to take into Chapter 25:
- Contiguous KV cache wastes 90%+ of allocated memory through internal fragmentation.
- PagedAttention applies OS-style paging to the KV cache. Fixed-size blocks, per-sequence block tables, dynamic allocation.
- Block size 16 is the empirical sweet spot. Smaller = more overhead, larger = more internal fragmentation.
- Block tables are page tables. The kernel translates logical positions to physical block addresses.
- External fragmentation is eliminated because all blocks are the same size.
- Copy-on-write enables prefix sharing. Multiple sequences share prefix blocks; copies are made only when one sequence writes.
- Eviction is by drop-and-recompute in vLLM’s default. Optional swap to CPU.
- PagedAttention is the foundation of every modern LLM serving stack. The paper and the code are required reading.
In Chapter 25 we look at the kernel-level optimization of attention itself: FlashAttention.
Read it yourself
- Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023). The vLLM paper. Read sections 3 and 4 carefully.
- The vLLM source code, especially
vllm/attention/backends/paged_attn.pyand the C++/CUDA kernel incsrc/attention/. - The Operating Systems textbook chapter on paging — any modern OS textbook (Tanenbaum, Silberschatz). The PagedAttention paper assumes you know it.
- The vLLM blog post “vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention” — the introductory writeup with the original benchmarks.
Practice
- Compute the internal fragmentation for a workload with 100 concurrent sequences, average length 800 tokens, and
block_size = 16. What’s the wasted memory in MB? - Why does PagedAttention eliminate external fragmentation? What property of fixed-size blocks makes this work?
- Walk through the COW logic for two sequences that share a 500-token prefix. How many physical blocks are needed in total? What happens when sequence A diverges from the prefix at token 501?
- Why does the PagedAttention CUDA kernel achieve performance close to FlashAttention on contiguous cache, despite the block table indirection? Trace through the memory access pattern.
- The vLLM scheduler uses drop-and-recompute as the default eviction strategy. Why not always swap to CPU instead? Argue both ways.
- Compute the maximum number of concurrent 4096-token sequences that fit in 16 GB of KV cache budget for Llama 3 70B with PagedAttention vs the contiguous baseline.
- Stretch: Read
vllm/csrc/attention/attention_kernels.cuand identify the block table lookup logic. Draw the data flow from logical position → physical block → memory read.
Concept check
4 questions. Click a choice to check. Your score is saved locally.
- 1. Without PagedAttention, why does pre-allocating the maximum sequence length per request waste so much memory?
- 2. PagedAttention borrows the concept of paging directly from operating system virtual memory. What is the mapping in the analogy?
- 3. How does PagedAttention's copy-on-write mechanism enable efficient prefix caching for requests that share a common prompt prefix?
- 4. PagedAttention reports 2 to 4x throughput improvement on real workloads over contiguous KV allocation. This improvement comes primarily from which effect?