KV cache sharing across replicas: LMCache and friends
"Per-replica prefix caches are good. Cross-replica prefix sharing is great. The math is brutal: hit rate scales with the entire fleet's traffic"
We covered prefix caching within a single vLLM instance in Chapter 29 and KV cache offload in Chapter 37. This chapter is about the production reality of sharing the KV cache across multiple inference replicas in a fleet. By the end you’ll know how it works, what it costs, when to use it, and what the major implementations are.
This is a relatively new area in 2024-25 — production-grade KV cache sharing across replicas was research three years ago and is now becoming standard practice. The main open-source implementation is LMCache, which we briefly covered in Chapter 37.
Outline:
- The per-replica vs cross-replica problem.
- Why cross-replica matters: the hit-rate math.
- The architecture: shared cache backends.
- LMCache in detail.
- Network bandwidth requirements.
- Consistency semantics.
- Operational considerations.
- The decision: when to use it.
53.1 The per-replica vs cross-replica problem
Recall from Chapter 29: prefix caching within a single vLLM replica works by storing the KV cache for prompt prefixes and reusing it across requests. When the same system prompt is used by 100 sequential requests on the same replica, the prefix is cached once and reused 100 times. The TTFT savings are huge.
But prefixes don’t always hit the same replica. In a multi-replica fleet:
- Replica 1 sees user A’s request with system prompt X. Caches X.
- Replica 2 sees user B’s request with system prompt X. Doesn’t have X in its cache. Recomputes from scratch.
- Replica 3 sees user C’s request with system prompt X. Doesn’t have X. Recomputes again.
- … and so on.
The cache hit rate per replica is traffic_per_replica × prefix_reuse_rate. With 10 replicas, each replica sees 1/10 the traffic, so the per-replica hit rate is 1/10 of what it could be.
For the fleet as a whole, most prefix recomputations are wasted. The work was already done by some other replica; you just don’t have access to it.
The fix is cross-replica KV cache sharing: have a shared cache that all replicas can read from and write to. When any replica computes a new prefix’s KV cache, the cache is stored centrally and any other replica can fetch it.
53.2 Why cross-replica matters: the hit-rate math
The hit-rate math is the killer argument for cross-replica sharing.
For a per-replica cache, the hit rate is roughly:
hit_rate_per_replica ≈ (per_replica_traffic × prefix_reuse_rate) / cache_size
Where prefix_reuse_rate is “what fraction of requests share a prefix with a recent request.” For a typical chat workload with a fixed system prompt, this is essentially 1.0 — every request shares the prefix.
The bottleneck is per_replica_traffic. If you have 10 replicas, each sees 10% of the traffic. The cache hit rate scales with that.
For a cross-replica cache, the hit rate is:
hit_rate_fleet ≈ (total_fleet_traffic × prefix_reuse_rate) / shared_cache_size
The hit rate scales with total fleet traffic, not per-replica. For a fleet of 10 replicas with shared cache, the effective hit rate is ~10× higher than per-replica.
The empirical numbers from production deployments:
- Per-replica cache: 30-50% hit rate (depending on workload).
- Cross-replica cache: 80-95% hit rate.
The difference is dramatic. For a workload where the prefix is most of the prompt cost (RAG, long system prompts), going from 30% hit rate to 90% hit rate cuts prefill compute by ~3-4×. The cost savings on a large fleet are enormous.
53.3 The architecture: shared cache backends
The basic architecture for cross-replica KV cache sharing:
Client requests
|
v
[AI Gateway]
|
+---> Replica 1 (vLLM)
+---> Replica 2 (vLLM)
+---> Replica 3 (vLLM)
...
|
v
[Shared KV Cache Backend]
(Redis, RDMA, etc.)
Each vLLM replica:
- Receives a request.
- Looks up the prompt prefix in the shared cache.
- If hit: fetches the KV blocks from the shared cache, populates its local KV cache with them, and starts decode immediately.
- If miss: prefills locally, writes the new blocks to the shared cache, then continues with decode.
The shared cache is a separate component:
- Storage: Redis (network-accessible, fast), or RDMA-based (very fast), or distributed file system (slower but persistent).
- Indexing: a content-addressed map from token sequence hashes to KV blocks.
- Tiering: blocks may be in HBM (on the active replica), CPU DRAM (on the active replica), or in the shared backend (Redis).
The shared cache is content-addressed: the key is the hash of the prefix’s tokens, so any replica that computes the same prefix produces the same key. The cache is consistent without coordination.
53.4 LMCache in detail
LMCache is the leading open-source implementation of cross-replica KV cache sharing. It integrates with vLLM and SGLang.
The architecture:
- A client library runs in each inference replica. It intercepts KV cache lookups and writes.
- A server (Redis or similar) stores the shared cache.
- A tier manager handles which blocks live in HBM, CPU, NVMe, or the remote backend.
The flow:
- A request arrives at replica 1.
- The vLLM scheduler asks LMCache “is the prefix for this prompt cached?”
- LMCache looks in HBM (fastest), then CPU DRAM, then NVMe, then Redis (slowest).
- On a hit: LMCache returns the blocks. They’re loaded into HBM if not already there.
- On a miss: LMCache returns nothing. vLLM does the prefill normally.
- After prefill, vLLM writes the new blocks to LMCache. LMCache propagates to higher tiers as needed.
The key configuration parameters:
- Tier sizes: how much HBM, CPU, NVMe, and Redis storage to use.
- Eviction policy: LRU is the default.
- Replication: how many copies of each block to keep in Redis (for fault tolerance).
LMCache integrates with vLLM via a configuration flag:
vllm serve <model> \
--kv-transfer-config '{"kv_connector": "LMCacheConnector"}'
The connector handles the integration. LMCache itself runs as a separate process or service.
53.5 Network bandwidth requirements
Cross-replica cache sharing has a real cost: network bandwidth. Every cache miss that gets fetched from the remote backend pulls data over the network.
For a 1000-token prefix on Llama 3 70B, the KV cache is ~320 MB. Every cache miss that fetches this from Redis pulls 320 MB. A fleet of 10 replicas serving 100 requests/second where 50% of requests require a remote fetch from Redis (the local-HBM hit rate is ~40%; the remaining 10% are full misses) generates:
10 replicas × 100 req/s × 0.5 miss_rate × 320 MB = 160 GB/s of network traffic
That’s a lot. Most data center networks can handle it (100 Gbps Ethernet is 12.5 GB/s per link), but you need a careful network design.
The mitigations:
- InfiniBand or RDMA Ethernet for the shared cache backend. 200-400 Gbps fabric.
- Compression of KV blocks before sending. INT8 quantization can halve the bandwidth.
- Smart routing at the gateway: prefer to send a request to a replica that already has the prefix cached locally. This reduces remote fetches.
- Hierarchical caching: a per-zone or per-rack cache reduces cross-cluster traffic.
In practice, cross-replica cache sharing is most valuable when:
- The fleet is large (many replicas).
- Prefixes are reused heavily across the fleet.
- The network can support the bandwidth.
For small fleets (<10 replicas) or workloads with little cross-fleet sharing, per-replica caching is enough.
53.6 Consistency semantics
The shared cache is content-addressed, which means it has eventual consistency: a block written by one replica may not be immediately visible to others. If replica 1 writes block X at time T, replica 2 reading at time T+epsilon might still get a miss (because the write hasn’t propagated through Redis yet).
The consequence: occasional double computation. Two replicas may simultaneously compute the same prefix because neither sees the other’s pending write. The wasted work is small (one extra prefill per occasional race), and the alternative (strong consistency) would be much slower.
For most workloads, eventual consistency is fine. The hit rate degradation from races is small (under 5%) and the simplicity is worth it.
If you need strong consistency (e.g., for some specific compliance or correctness reason), you can use a lock-based protocol — replicas acquire a lock before computing a prefix, and other replicas wait. This is much slower but guarantees no double-computation.
53.7 Operational considerations
A few production-relevant points:
(1) The Redis backend is critical infrastructure. If Redis goes down, cache hits all become misses, and the fleet’s effective compute requirement spikes. Have HA Redis (Redis Cluster, Redis Sentinel) and monitor it carefully.
(2) Memory pressure on Redis. The shared cache can be terabytes. Size Redis appropriately, with enough headroom for growth and eviction policies that handle pressure gracefully.
(3) Cache warming. When a new replica joins the fleet, it has an empty local cache. The first few requests on it pay the remote-fetch cost. If you’re scaling up frequently, this adds latency variability.
(4) Per-tenant isolation. Multi-tenant workloads need to ensure tenant A’s cached prefixes don’t leak to tenant B. The cache key should include a tenant ID, or the cache should be per-tenant.
(5) Versioning. Different model versions have different KV cache shapes. Caches from one version are useless for another. The cache key should include the model version.
(6) Monitoring. Track:
- Cross-replica hit rate (the key metric).
- Redis memory usage and eviction rate.
- Network bandwidth between replicas and cache backend.
- Latency of cache lookups.
(7) Alternative: prefix-aware routing. Instead of (or in addition to) cross-replica caching, the gateway can route requests with the same prefix to the same replica. This keeps the cache local and avoids network traffic. SGLang has built-in support for this.
(8) Fallback mode. If the shared cache is unavailable, replicas should fall back to per-replica caching automatically. Don’t let cache failures bring down the fleet.
53.8 The decision: when to use cross-replica cache sharing
The benefits are biggest when:
- Large fleet (10+ replicas).
- Heavy prefix reuse (chat with system prompts, RAG, multi-tenant with shared templates).
- Long prefixes (1000+ tokens — the absolute prefill savings are larger).
- Fast network (100 Gbps+ between replicas and cache backend).
The benefits are smaller when:
- Small fleet (<5 replicas).
- Unique prompts (search engine, summarization).
- Short prefixes.
- Slow network.
For most large production deployments, the math favors cross-replica sharing. The hit-rate improvement is dramatic, the operational cost is real but manageable, and the cost savings are large.
For small or specialized deployments, per-replica caching is enough.
The decision is increasingly easy in 2025 as LMCache and similar tools mature. Expect cross-replica KV cache sharing to become a standard feature of production LLM serving over the next year or two.
53.9 The mental model
Eight points to take into Chapter 54:
- Per-replica caches scale hit rate with per-replica traffic. Cross-replica caches scale hit rate with total fleet traffic.
- The hit rate math can take you from 30% to 90% — a 3× reduction in prefill compute.
- LMCache is the leading open-source implementation. Integrates with vLLM and SGLang.
- The shared cache is content-addressed. Same prefix → same key → cache hit across replicas.
- Network bandwidth is the cost. Need 100+ Gbps for large fleets.
- Eventual consistency is fine for most workloads. Occasional double-computation is acceptable.
- Prefix-aware routing is a complementary approach: route similar prefixes to the same replica.
- For large fleets with shared prefixes, cross-replica sharing is the right choice. For small or unique-prompt workloads, per-replica is enough.
In Chapter 54 we look at the operational side of model startup: warmup, readiness probes, and the model-isn’t-ready-yet problem.
Read it yourself
- The LMCache GitHub repository and documentation.
- The vLLM
KVConnectorinterface documentation. - The SGLang documentation on prefix-aware routing.
- The NIXL (NVIDIA Inference Xfer Library) documentation.
- Production case studies on cross-replica KV cache sharing (search for “LMCache production”).
Practice
- Compute the cross-replica hit rate improvement for a fleet of 20 replicas serving 1000 req/s where each request shares an 80% prefix with recent requests.
- Estimate the network bandwidth required for a fleet of 10 replicas serving 50 req/s with 30% remote cache miss rate and 200 MB average prefix.
- Why is cross-replica cache sharing eventually consistent? What’s the alternative and why is it slower?
- Read the LMCache integration guide for vLLM. Identify the configuration steps.
- Why does prefix-aware routing complement cross-replica sharing? When would you use one vs the other?
- For a fleet of 3 replicas serving 5 req/s, is cross-replica sharing worth the operational overhead? Justify.
- Stretch: Set up LMCache with vLLM on a local cluster. Verify cache hits across replicas with a workload that has shared prefixes.