Chapter 37: KV cache compression and offload: LMCache, RDMA, NVMe tiering

In Chapter 22 we showed that the KV cache dominates GPU memory in any non-trivial LLM deployment. In Chapters 24 (PagedAttention), 26 (KV cache quantization), and 33 (GQA/MLA), we saw techniques to reduce the per-token cache size or use the available memory more efficiently. This chapter is about what happens when even those aren’t enough — when you have a workload where the cache simply doesn’t fit in HBM and you need to store some of it somewhere else.

The “somewhere else” forms a hierarchy: GPU HBM → CPU DRAM → local NVMe SSD → remote storage over the network (RDMA, S3). Each tier is slower than the one above and the cost of fetching from it has to be weighed against the alternative of recomputing the cache from scratch.

This chapter covers the offload techniques, the LMCache project (which is the most prominent open-source implementation), and the trade-offs that determine when offload helps.

Outline:

The memory hierarchy revisited.
The recompute vs fetch trade.
CPU offload — the simplest tier.
NVMe offload.
RDMA-based remote KV cache.
LMCache — the open-source implementation.
Cross-replica KV sharing.
The KV-cache-as-a-service vision.
When offload helps and when it doesn’t.

37.1 The memory hierarchy revisited

The latency and bandwidth of each storage tier:

Tier	Latency	Bandwidth	Cost per GB
GPU HBM (H100)	~0.1 μs	3 TB/s	very high
GPU HBM (H200)	~0.1 μs	4.8 TB/s	very high
CPU DRAM	~80 ns	~400 GB/s (per socket)	high
Local NVMe SSD	~100 μs	~7 GB/s (Gen5)	medium
Remote (RDMA over IB)	~5 μs	~25 GB/s	varies
Remote (S3, network)	~10 ms	~1 GB/s	low

The KV cache memory hierarchy has two orders of magnitude separating each tier in bandwidth — the serving stack evicts blocks down the hierarchy as HBM fills, and fetches them back on a hit rather than recomputing.

Two orders of magnitude separate each tier. The right approach is to store hot data in fast tiers and cold data in slow tiers — a classical caching hierarchy.

For LLM serving, the question is: what counts as hot KV cache? And the answer depends entirely on the access pattern. The cache for an in-flight request is very hot (read every step). The cache for a prefix that might be reused in the next minute is warmer (read on every cache hit). The cache for a prefix that hasn’t been seen in an hour is cold (might be read again, might not).

Each tier has a different sweet spot:

HBM: in-flight request KV cache, hot prefix cache.
CPU DRAM: warm prefix cache, evicted in-flight requests during transient memory pressure.
NVMe: cold prefix cache, long-term storage of frequently-reused prefixes (system prompts, RAG documents).
Remote: very cold cache, cross-replica sharing for global hit rates.

37.2 The recompute vs fetch trade

Here’s the framing question for any offload decision:

Is fetching the KV cache from a slower tier cheaper than recomputing it from scratch?

The cost of recomputing is the prefill cost, which is roughly:

Recompute cost grows linearly with prefix length while fetch cost is mostly bandwidth-limited; for prefixes beyond ~5k tokens NVMe fetch is cheaper than recompute, and CPU DRAM fetch is always faster than recomputing.

recompute_cost ≈ (S_prefix × per_token_compute_cost)

The cost of fetching is:

fetch_cost ≈ (S_prefix × KV_per_token / fetch_bandwidth + fetch_latency)

For a 1000-token prefix on Llama 3 70B:

KV cache size: ~320 MB.
Recompute cost: ~0.3 seconds of prefill compute on H100.
Fetch from CPU DRAM: ~0.001 seconds (negligible).
Fetch from local NVMe: ~50 ms (320 MB / 7 GB/s).
Fetch from RDMA: ~15 ms (320 MB / 25 GB/s).
Fetch from S3: ~320 ms (320 MB / 1 GB/s).

The CPU and RDMA tiers are clearly worth using — fetching is much cheaper than recomputing. NVMe is borderline — slightly cheaper than recompute but not dramatically. S3 is barely worth it for short prefixes but might be worth it for very long ones where recompute would take seconds.

For a 10,000-token prefix:

Recompute: ~3 seconds.
Fetch from local NVMe: ~500 ms.
Fetch from RDMA: ~150 ms.

NVMe becomes much more attractive at longer prefixes, because recompute scales linearly with length but fetch is mostly just a bandwidth cost. For very long contexts (32k+ tokens), even slow NVMe is much faster than recomputing.

The general rule: the longer the prefix, the more attractive offload is. For short prefixes, recompute might be just as fast.

37.3 CPU offload

The simplest tier: swap KV cache to CPU DRAM when GPU memory is full.

The mechanism: when the scheduler decides to evict an in-flight request (Chapter 22), instead of dropping the cache entirely, it copies the cache to CPU DRAM via PCIe. When the request is later resumed, the cache is copied back to HBM.

The cost: the PCIe transfer is slow compared to HBM (a few hundred MB/s in practice for fragmented transfers). For a 320 MB KV cache, that’s about 1 second of transfer time. Compared to a few seconds of recompute, it’s a wash for short caches but can be worth it for longer ones.

vLLM supports CPU offload via the swap_space configuration. The default swap_space is 4 GiB per GPU; you can configure it higher if you have CPU memory to spare. When the cache pool is exhausted, vLLM swaps low-priority requests’ KV cache to CPU instead of dropping them.

CPU offload is mostly a safety valve for transient memory pressure. It’s not a “make the system faster” optimization; it’s a “prevent the system from rejecting requests” feature.

37.4 NVMe offload

The next tier down: store cold KV cache on local NVMe SSDs. NVMe is roughly 10× slower than CPU DRAM but 10× cheaper per GB and available in much larger capacities (TB-scale per node).

The use case: persistent prefix cache for frequently-reused prefixes. System prompts, RAG documents that are queried over and over, common few-shot examples — these are cached on NVMe and reloaded into HBM when needed.

The setup is similar to CPU offload but slower:

The serving stack writes cold cache blocks to a file on local NVMe.
An indexing structure (similar to the prefix cache) tracks which blocks are on disk vs in HBM.
When a request hits a prefix that’s only on disk, the relevant blocks are loaded into HBM via PCIe + NVMe.

The latency cost is significant — 50-500 ms per request for an NVMe load — but it’s still faster than recomputing for long prefixes. For a 30k-token prefix, NVMe load takes ~1 second vs ~10 seconds of recompute.

NVMe offload is most valuable for long-prefix workloads with lots of reuse. For short-prefix or low-reuse workloads, the operational cost isn’t justified.

37.5 RDMA-based remote KV cache

The next tier: store KV cache on remote nodes, accessed over a fast network (InfiniBand RDMA).

The use case: share KV cache across replicas. In a multi-replica serving fleet, each replica has its own local prefix cache. If user A’s request hits replica 1 and warms up the cache for a system prompt, then user B’s request hits replica 2 with the same system prompt, replica 2 has to recompute the cache from scratch — even though replica 1 has it.

A shared remote KV cache lets all replicas read from a common pool. Replica 1’s compute warms up the cache for everyone; replica 2 fetches it via RDMA instead of recomputing. The aggregate hit rate goes up dramatically because the cache is shared across the entire fleet.

RDMA over InfiniBand (200 Gbps = 25 GB/s) is fast enough to be competitive with local NVMe and much faster than recomputing. The latency is a few microseconds plus the transfer time.

This is the architecture LMCache (next section) implements.

37.6 LMCache — the open-source implementation

LMCache (Liu et al., 2023, open-source) is the most prominent open-source KV cache offload implementation. It provides:

A multi-tier KV cache with HBM → CPU DRAM → local NVMe → remote (Redis or other backends).
A content-addressed storage model: cache blocks are keyed by their token sequence hash, so any replica can look them up.
Integration with vLLM and SGLang as a backend store.

The architecture:

graph TD
  A["vLLM / SGLang serving stack"] --> B["LMCache (tier-aware KV store)"]
  B --> C["GPU HBM — in-process fast tier"]
  B --> D["CPU DRAM — in-process medium tier"]
  B --> E["NVMe SSD — file-based slow tier"]
  B --> F["Redis — shared remote tier"]
  F -.->|"cross-replica sharing"| G["Other replicas' LMCache"]
  style C fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style F fill:var(--fig-surface),stroke:var(--fig-border)

LMCache walks tiers top-to-bottom on lookup (HBM first) and writes new blocks to the fastest available tier, evicting down as each tier fills.

When the serving stack needs to look up a KV cache for a prefix, it queries LMCache. LMCache walks down the tiers in order — HBM first, then CPU, then NVMe, then Redis — until it finds the cache (or all tiers miss, in which case the serving stack recomputes).

When a new cache block is computed, it’s written to the fastest tier first (HBM). As HBM fills, blocks are evicted to CPU; as CPU fills, evicted to NVMe; as NVMe fills, evicted to Redis (or evicted entirely).

The cross-replica sharing happens through the Redis tier. All replicas in the fleet share the same Redis backend, so a block written by one replica is visible to all others.

The benefit: the fleet-wide cache hit rate is much higher than per-replica. For a workload with shared prefixes (which is most production workloads), the hit rate jumps from ~30% per replica to ~80%+ fleet-wide. The recompute cost saved is large.

The cost:

The Redis tier adds latency (a few ms for a remote fetch).
The Redis backend itself has to be provisioned and managed.
The cross-replica consistency semantics are eventually-consistent — if a block is written but not yet propagated, the other replicas will recompute.

LMCache is mature and used in real production deployments. The integration with vLLM is straightforward (a configuration flag) and the cross-replica benefit is well-documented.

The deep idea behind LMCache (and similar projects) is cross-replica KV cache sharing. Without it, the prefix cache is per-replica and the hit rate is bounded by how much traffic each individual replica sees. With it, the cache is fleet-wide and the hit rate scales with the entire fleet’s traffic.

The math is simple. If each replica sees 100 requests/second and the average prefix is reused twice per minute, the per-replica hit rate is 100 × 2 / 60 ≈ 3% × replicas_with_prefix. With sharing across K replicas, every prefix is “seen” K× more often, and the hit rate scales with K.

For a fleet of 10 replicas, cross-replica sharing can take the hit rate from ~30% to ~90% on a typical workload. The compute saved is enormous.

The catch is bandwidth: every time a replica fetches a cache block from a remote tier, it consumes network bandwidth. For a fleet of 10 replicas each fetching from a shared Redis backend, the Redis backend’s bandwidth becomes the bottleneck. You need a high-bandwidth interconnect (RDMA or fast Ethernet) and a Redis backend that can handle the load.

In practice, cross-replica KV sharing only works if your network is fast enough. NVLink-based fabrics (within a node) are fast enough; modern datacenter Ethernet (100 Gbps+) is barely fast enough; older Ethernet (10 Gbps) is too slow.

37.8 The KV-cache-as-a-service vision

A natural evolution of these ideas: treat the KV cache as a first-class service that’s separate from the serving stack. The serving stack becomes “stateless” with respect to KV cache — every prefix lookup goes to the KV cache service, every new computation is written back. The KV cache service handles tiering, eviction, sharing across replicas, and persistence.

This is the long-term direction the LLM serving community is heading. Several projects (LMCache, NIXL from NVIDIA, various proprietary systems) are converging on this model.

The advantages:

Stateless serving replicas. They can be scaled, restarted, or migrated freely.
Centralized cache management. Eviction, tiering, and policies are handled in one place.
Shared across the entire fleet. Maximum hit rate.
Persistence. Caches survive replica restarts.

The challenges:

Network bandwidth. All cache traffic goes over the network.
Latency. Even fast networks are slower than local HBM.
Operational complexity. Another service to manage.

For most production serving today, the KV cache is per-replica with optional CPU/NVMe offload. The KV-cache-as-a-service model is emerging but not yet the default. As the techniques mature, expect it to become more common.

37.9 When offload helps and when it doesn’t

The decision tree:

Use CPU offload when:

You have transient memory pressure that causes evictions.
You have CPU memory headroom and aren’t bandwidth-constrained.
The performance cost of swapping is less than the cost of dropping/recomputing.

Use NVMe offload when:

You have very long prefixes (32k+ tokens) that are reused across requests.
You have NVMe headroom and the workload has reasonable locality.
You can tolerate ~100ms cache load times for hits.

Use RDMA / remote tiers when:

You have a multi-replica fleet with shared prefixes.
You have fast network (InfiniBand or 100Gbps+ Ethernet).
The aggregate hit-rate improvement is worth the operational overhead.

Don’t use offload when:

Your cache fits in HBM. No reason to add complexity.
Your prefixes are short or unique. The fetch cost approaches recompute.
Your network is too slow for the remote tier to be faster than recompute.

For most production deployments today, HBM + GQA + KV quantization is enough. Offload is for specific high-value cases — long context, multi-replica with shared prefixes, very memory-constrained environments.

37.10 The mental model

Eight points to take into Chapter 38:

The KV cache lives in a memory hierarchy. HBM > CPU DRAM > NVMe > remote.
The decision is recompute-vs-fetch. Each is cheap or expensive depending on the prefix length.
CPU offload is a safety valve for transient memory pressure.
NVMe offload is for long, reused prefixes. Worth it when prefix reuse is high and contexts are long.
RDMA / remote tiers enable cross-replica sharing, which dramatically increases fleet-wide hit rate.
LMCache is the canonical open-source multi-tier KV cache. Used in production with vLLM and SGLang.
KV-cache-as-a-service is the long-term direction. Not yet the default but coming.
Use offload only when justified. HBM + GQA + KV quantization is enough for most workloads.

In Chapter 38 we look at the kernel-level optimization frontier — CUDA, CUTLASS, Triton, TVM, and the question of whether you ever need to write your own.

Read it yourself

Liu et al., LMCache: A Hybrid Cache for LLM Serving (2023).
The vLLM documentation on CPU offload (swap_space configuration).
The LMCache GitHub repository and integration guides.
The NIXL (NVIDIA Inference Xfer Library) documentation for the cross-node case.
The Pope et al. Efficiently Scaling Transformer Inference paper for the cost analysis framing.

Practice

For a 5000-token prefix on Llama 3 70B, compute the recompute cost vs the fetch cost from CPU DRAM, NVMe, and RDMA. Which tier is the cheapest?
Why is CPU offload mostly a safety valve rather than a performance optimization? Compare PCIe bandwidth to HBM bandwidth.
Cross-replica KV sharing can increase hit rate from 30% per replica to 90% fleet-wide. Why does the math work out that way?
Read the LMCache paper. Identify the eviction policy and explain why it’s a good fit for the multi-tier model.
For a fleet of 20 replicas with shared prefixes, estimate the network bandwidth Redis would need to handle if every cache miss goes to Redis. Is this feasible on 100 Gbps Ethernet?
Why is the KV-cache-as-a-service model attractive long-term? What are the obstacles?
Stretch: Set up LMCache with vLLM on a multi-GPU machine. Configure CPU offload and measure the impact on a workload that exceeds GPU KV memory.

Concept check