Caching patterns: Redis, cache-aside, write-through, write-behind
"There are only two hard things in computer science: cache invalidation, and naming things. Cache invalidation is three of them"
Every serving system accumulates caches. There is the CDN at the edge, the HTTP cache in front of the API, the Redis instance shared by all the backends, the in-process LRU in every pod, the page cache in Linux, the chip caches on the CPU, and, in an LLM serving system specifically, the KV cache (Chapter 22), the prefix cache (Chapter 29), the embedding cache, and the tokenizer cache. The reason caches accumulate is that the cost of producing an answer is almost always higher than the cost of remembering one, and the reason that’s load-bearing is that the former cost grows with scale while the latter doesn’t. Any system that doesn’t cache eventually falls over.
But caching is not free. A cache introduces a new source of truth that can disagree with the database, and when they disagree, you get bugs that are subtle, expensive to diagnose, and hard to reproduce. The discipline of caching is the discipline of picking a pattern that makes the disagreement either impossible or acceptable. This chapter goes through the canonical caching patterns from Designing Data-Intensive Applications and applies them to ML-system hot paths: model metadata, feature lookups, routing decisions, rate-limit buckets, embedding results. The focus is on the pattern choice (cache-aside vs write-through vs write-behind vs refresh-ahead), the TTL story, and the stampede protection techniques that separate a cache that works from a cache that collapses under its own success.
Outline:
- Why caches exist and what they cost.
- Redis as the default network cache.
- Cache-aside (lazy loading).
- Write-through and write-behind.
- Refresh-ahead and predictive caching.
- TTL choice and invalidation strategies.
- Thundering herds and stampede protection.
- Singleflight and request coalescing.
- Cache patterns in ML serving systems.
- The mental model.
89.1 Why caches exist and what they cost
A cache is a fast, usually volatile store that answers queries faster or cheaper than the canonical store. Every cache makes a tradeoff:
- Latency vs freshness. A cached answer is faster than a fresh one. It is also older.
- Cost vs correctness. A cache reduces load on the expensive store. It also risks serving stale or incorrect data.
- Simplicity vs complexity. A cache moves your mental model from “read from the database” to “read from the cache, which may or may not have the thing, which may or may not be fresh, which needs to be filled from the database if absent.”
The economic argument for caching is the one that wins in production: a cache hit is 10-1000× cheaper than a miss, and most real workloads have skewed access patterns where a small set of keys accounts for most of the traffic (Zipf distribution). Cache the hot keys, take the load off the backend, and save real money.
The correctness argument is subtler. A cache is a source of truth that lies — it returns a value that may not match the canonical store. The question is whether that lie is acceptable for the use case. For a dashboard showing total users, a 30-second-old value is fine. For a user’s account balance on the payments screen, it’s not. For a feature store returning a user’s last-seen article for ranking, it’s fine until it’s stale enough that the ranking looks wrong. The engineer’s job is to map each read path to an acceptable staleness budget and pick a cache pattern that lives within it.
The cost of caching, the part that’s easy to miss: a wrong cache is worse than no cache. Wrong cache means stale, incorrect, or inconsistent. Stale cache serves bugs. Inconsistent cache (different replicas returning different values) serves flakes. Every cache you add is a new place for production bugs to hide.
89.2 Redis as the default network cache
Redis is the default network cache for almost every web and ML system. It is a single-threaded, in-memory data store that supports strings, hashes, lists, sets, sorted sets, bitmaps, HyperLogLogs, geo indexes, streams, and more. It runs at around 100k-1M operations per second on a single node, with sub-millisecond latencies when the network is good. It supports replication (async, master-replica) and partitioning (via Redis Cluster, which shards keys across nodes). It is not quite as fast as Memcached for pure GET/SET, but it is vastly more useful because of the data structures.
For caching, the operations that matter:
GET,SET,DEL— the basic key-value API.EXPIRE,TTL,PERSIST— TTL management.INCR,DECR— atomic counters, used for rate limits.HGET,HSET,HDEL— hash fields, used to store small structured objects without serialization overhead.ZADD,ZRANGE— sorted sets, used for leaderboards, rate limits with sliding windows, and priority queues.SETNX(orSET NX) — set only if not exists. The primitive for distributed locks and singleflight.EVAL— run a Lua script atomically. Useful for complex atomic operations.
Redis’s single-threaded execution model is worth a pause. All commands on a Redis node run serially, so there is no per-command locking. This makes multi-step atomic operations trivial via EVAL or MULTI/EXEC transactions. The cost is that a single slow command blocks everything (don’t run KEYS * on a production Redis), and a single Redis node tops out at one core’s worth of command throughput, so scaling beyond ~1M ops/sec requires sharding.
Redis Cluster shards the keyspace across nodes using a 16,384-slot hash partitioning scheme. Each key hashes to a slot; slots are assigned to nodes. Clients compute the slot locally and route to the right node. The catch is that multi-key operations (MSET, MGET, transactions) only work within a single slot. To force two keys to the same slot, use hash tags: {user:123}:name and {user:123}:email both hash by the user:123 part and end up on the same slot.
Redis Sentinel is the older high-availability solution: a master with async replicas, and Sentinel watches for master failure and promotes a replica. Sentinel is being phased out in favor of Redis Cluster for new deployments, but you’ll still see it in the wild.
For ML platforms, the typical pattern is a Redis Cluster with 3-6 shards, each with 1-2 replicas, sitting in front of the hot-path services. Total capacity is a few hundred GB of RAM, throughput in the hundreds of thousands of ops/sec, p99 latency under 2 ms. That is the network cache tier.
89.3 Cache-aside (lazy loading)
Cache-aside is the simplest and most common pattern. The application code explicitly manages the cache:
def get_user(user_id):
cached = redis.get(f"user:{user_id}")
if cached is not None:
return json.loads(cached)
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
if user is not None:
redis.setex(f"user:{user_id}", 300, json.dumps(user)) # TTL 5 minutes
return user
The flow:
- Read the cache. If the key is present, return it (hit).
- Read the database. If the cache was absent, fetch from the canonical store (miss).
- Populate the cache. Store the value with a TTL so it expires naturally.
- Return the value.
Writes are straightforward:
def update_user(user_id, updates):
db.execute("UPDATE users SET ... WHERE id = %s", ..., user_id)
redis.delete(f"user:{user_id}") # invalidate the cache
On write, invalidate the cache rather than update it. The next read will miss and reload. This is simpler than trying to update both in lockstep and avoids the race where the cache gets updated with stale data.
The properties of cache-aside:
- Simple. The application controls everything. No magic.
- Lazy. Data only enters the cache when it’s requested. The first request after a write is slow.
- Tolerant of cache failures. If Redis is down, the application still works (slower). Reads fall through to the database.
- Prone to thundering herd on misses. If a key is popular and expires, many simultaneous readers will all miss and hit the database. Stampede protection is critical.
- Stale windows are bounded by TTL. A value can be stale for up to its TTL. Pick the TTL based on your staleness budget.
Cache-aside is the default for most use cases. It is what every senior engineer should reach for first.
89.4 Write-through and write-behind
Write-through flips the responsibility: the cache is the write path, and the database is written atomically as part of the same operation.
def update_user(user_id, updates):
user = apply(updates, get_user(user_id))
cache.set(f"user:{user_id}", user) # cache library writes to DB and cache atomically
return user
The cache library (or a wrapper in the application) writes to the database and the cache in one operation. From the application’s view, the cache is the database. Reads always come from the cache, and the cache is always up to date because every write goes through it.
Write-through is appealing because it eliminates the stale window — the cache and the database are always consistent (within the atomicity guarantees of the underlying pair). But it has real costs:
- Every write pays the database latency. No fire-and-forget writes. If the database is slow, the cache is slow.
- The cache is a hard dependency for writes. If Redis is down, writes fail.
- You’re writing through to the cache for every update, even data that’s never read. Wasted effort.
- The cache library has to coordinate atomicity across two systems. Tricky to get right.
Write-through is the right pattern when reads vastly outnumber writes and you need tight consistency. Model registries, feature flag stores, routing tables — anything where the write rate is low and readers want the freshest value immediately.
Write-behind (write-back) is write-through’s lazy cousin. Writes go to the cache, which returns immediately, and the cache asynchronously propagates them to the database in the background:
def update_user(user_id, updates):
cache.set(f"user:{user_id}", ...)
queue.push(("write_user", user_id, ...)) # worker picks this up later
Write-behind is much faster for the writer — the database latency is out of the critical path. But:
- Risk of data loss. If the cache fails between the write and the flush, the write is gone.
- Ordering is hard. If writes come in quickly, the flusher has to preserve order or merge them.
- Read-after-write within a single process is fine (the cache has the new value) but across processes it’s eventually consistent.
- Failure recovery is complex. What happens if the flush fails after several retries?
Write-behind is the right pattern for high-write workloads where durability per-write is not critical: metrics, counters, rate-limit tokens, session state. It’s also the pattern you implicitly use when batching writes from a stream into a database.
For ML-system hot paths, write-behind is common for rate-limit bucket state (updated on every request, flushed periodically to a persistent store) and for usage counters (token counts per user, flushed every 10 seconds to billing).
89.5 Refresh-ahead and predictive caching
Refresh-ahead is a variant where the cache proactively refreshes entries before they expire. When a read arrives and the cached value is close to expiry (say, 80% of the TTL elapsed), the cache kicks off an asynchronous refresh. The current read still returns the slightly stale cached value, but the next read will get the freshly refreshed one.
The benefit is that popular keys never miss. Under normal cache-aside, a popular key expires every TTL and the first request after expiry pays the miss cost. Under refresh-ahead, the refresh happens in the background and no user-facing read pays the cost. The staleness window is slightly larger (up to 1.2× the TTL in the worst case) but the tail latency improves dramatically.
The cost is that you can refresh entries that never get accessed again, wasting work. This is why refresh-ahead is usually applied selectively to the top-N keys by access frequency, not to the entire cache.
Redis doesn’t implement refresh-ahead natively. You build it in your application or a cache library. A typical implementation uses a two-TTL scheme: a “soft TTL” that triggers refresh and a “hard TTL” that actually expires the entry.
def get_with_refresh(key, ttl_soft=60, ttl_hard=120):
entry = redis.get(key)
if entry is None:
return load_and_cache(key, ttl_hard)
value, stored_at = unpack(entry)
age = time.time() - stored_at
if age > ttl_soft:
# kick off async refresh, but return current value
async_queue.push(("refresh", key, ttl_hard))
return value
Predictive caching is a further step: pre-populating the cache based on predicted future access. For ML inference, you might pre-populate the embedding cache with recent queries that are likely to repeat. For a user dashboard, you might pre-fetch the next page before the user scrolls. Predictive caching is workload-specific and worth the complexity only for the heaviest-traffic use cases.
89.6 TTL choice and invalidation strategies
Picking the right TTL is the single most impactful cache decision. Too short and you miss too often; too long and you serve stale data. The framework for picking TTL:
Step 1: Define the staleness budget. For each type of data, how old can the cached value be before the application misbehaves? For model metadata (model name, version, quantization), hours to days is fine. For feature flags, seconds to minutes. For account balances, probably none.
Step 2: Factor in the invalidation path. If you can invalidate the cache on every write, the TTL only matters as a fallback. Long TTL is safe. If you can’t reliably invalidate (writes happen in another service, or the canonical store is external), TTL is your only freshness mechanism.
Step 3: Pick the TTL based on tail-case damage. The question is not “how often will the cache be stale,” it’s “if the cache is stale for X seconds, how bad is the worst thing that happens?” A billing TTL that’s 30 seconds too long overcharges a user by 30 seconds of usage. A permission TTL that’s 30 seconds too long grants deleted access for 30 seconds. The latter is a security incident.
Step 4: Add jitter. Every cache entry should have a small random component on its TTL (say ±10%) to prevent synchronous expiry of many keys at once. Without jitter, a batch of cache fills at noon all expire at noon + TTL, causing a miss stampede.
Invalidation strategies:
- TTL-only. Let entries expire naturally. Simple, and the TTL is the maximum staleness. Good when writes are rare or the backend is authoritative.
- Explicit delete on write. Delete the cache key whenever the backend is updated. Tightest consistency, but requires the writer to know about the cache. If writes come from many places, easy to miss one.
- Versioned keys. Include a version in the cache key (
user:123:v7). On write, bump the version. Old entries age out via TTL; readers always use the current version from some version store. Avoids invalidation races but requires a version source of truth. - Pub/sub invalidation. On write, publish to a Redis pub/sub channel. All replicas listening to the channel invalidate their local caches. Useful for near-cache patterns where each application node has its own in-process cache on top of Redis.
- CDC-driven. Listen to the database’s change stream (Debezium, logical replication, DynamoDB Streams) and invalidate based on the changes. Most robust but adds a new moving part.
The right combination depends on the data’s write pattern and the team’s appetite for complexity. For ML systems, TTL-only with a short TTL is the right default for read-heavy metadata, and explicit-delete-on-write with a longer TTL is the right default for things that are updated in a single service.
89.7 Thundering herds and stampede protection
The thundering herd problem: a popular cache entry expires, many concurrent readers all see the miss, all hit the backend simultaneously, and the backend tips over. This is the failure mode that turns a caching win into a production outage.
Example: a cache entry for a 70B model’s metadata has a TTL of 60 seconds and is read 10,000 times per second. At some point, the entry expires. Within a few milliseconds of expiry, the next several hundred requests all miss. They all query the model registry database, which is sized for the normal miss rate (~100 qps), not 10,000 simultaneous reads. The database slows down, the reads back up, the application pool fills with waiting requests, and the entire inference layer goes yellow.
Stampede protection is the set of techniques that prevent this. The main ones:
Lock-based (singleflight). When a miss occurs, acquire a lock on the key. Only the lock holder fetches from the backend. Other requests see that a lock is held and either wait or return a stale value. This is the Go singleflight package pattern, generalized. Implementation via Redis SETNX:
def get_with_lock(key):
val = redis.get(key)
if val is not None:
return val
lock_key = f"lock:{key}"
if redis.set(lock_key, "1", nx=True, ex=10):
try:
val = load_from_backend(key)
redis.setex(key, 60, val)
return val
finally:
redis.delete(lock_key)
else:
time.sleep(0.05)
return get_with_lock(key) # retry
The first miss gets the lock and fetches. Others wait briefly and retry, at which point the cache is usually populated.
Early refresh with probability. When a read finds a cached value that’s close to expiry, refresh it probabilistically. Xfetch’s formula is:
if (now - expiry + delta * beta * log(random())) > 0: refresh
where delta is the fetch latency and beta is a tuning parameter. As the entry approaches expiry, the probability of refresh rises smoothly, so different readers are likely to refresh at different times rather than all at once.
Stale-while-revalidate. Serve the stale value while a background task fetches the new one. The user sees a slightly old value but no one waits. This is the same as refresh-ahead from §89.5 with the “serve stale” behavior explicit.
Request coalescing. When multiple requests for the same key arrive while a fetch is in flight, wait for the fetch to complete and use its result instead of issuing another fetch. This is in-process singleflight, not Redis-wide. For a single service with multiple concurrent handlers, this is trivial to implement with a map of in-flight futures.
Jittered TTLs. Make the expiry time of similar entries not exactly the same, so they don’t all miss at once.
Combining jittered TTL + in-process coalescing + Redis-based locks covers most of the thundering-herd risk. For the highest-traffic keys (system prompts, model metadata), add stale-while-revalidate.
89.8 Singleflight and request coalescing
A closer look at singleflight, because it’s the one pattern that every senior engineer should know.
The observation: if a thousand concurrent handlers all want the same result, you don’t need a thousand backend calls. You need one backend call, and the rest should wait for it and share the result. The generic pattern:
class SingleFlight:
def __init__(self):
self._inflight = {}
self._lock = threading.Lock()
def do(self, key, fn):
with self._lock:
if key in self._inflight:
future = self._inflight[key]
else:
future = concurrent.futures.Future()
self._inflight[key] = future
fresh = True
if fresh:
try:
result = fn()
future.set_result(result)
except Exception as e:
future.set_exception(e)
finally:
with self._lock:
del self._inflight[key]
return future.result()
The do method guarantees that for a given key, only one call to fn is in flight at a time. All concurrent callers for the same key block on the same Future. The result is shared, the backend is called once, and the latency seen by all callers is the backend latency of that one call.
This pattern is in the Go standard library (golang.org/x/sync/singleflight), in various Python libraries, and in Redis-based form via SETNX locks. It is the single most useful cache-adjacent primitive.
The edge cases to handle:
- The backend call fails. All waiters see the same error. Do they all retry? They shouldn’t stampede on retry either — use exponential backoff.
- The backend call is slow. All waiters wait. Timeouts on the wait itself are important.
- The cache was populated while the call was in flight. The waiter should return the cached value, not the stale one from the backend call. This is why the cache check happens inside the lock.
For ML serving, singleflight is critical in front of model-loading operations (loading a model file into GPU memory is a minute-long operation; you don’t want ten concurrent requests to each load it) and in front of embedding lookups (computing an embedding is expensive; coalesce duplicate keys).
89.9 Cache patterns in ML serving systems
Applied to the ML stack specifically, the caches and their patterns:
Model metadata and routing. Read-heavy, rarely updated. Cache-aside with long TTL (minutes to hours) and explicit invalidation on model deploy. Shared Redis with 10-30 s local in-process LRU on top for hot keys.
Embedding cache. Query embeddings are expensive to compute (tens of ms on a TEI deployment) and many queries repeat exactly. Cache-aside with key = tokenized query hash, value = embedding vector, TTL = hours. Coalesce duplicates with singleflight.
Prefix KV cache. Discussed in Chapter 29. A specialized cache keyed by token sequence hash, storing precomputed K/V tensors per layer. It lives inside the inference server, not in Redis. LRU eviction because GPU memory is bounded.
Rate-limit buckets. High write rate, each request increments a counter. Write-behind pattern: atomic increments in Redis, periodic flush to persistent store for billing. Use INCR with EXPIRE to implement leaky-bucket or token-bucket algorithms.
Session/auth state. Read-heavy, writes on login/logout. Cache-aside with explicit delete on logout. TTL equals session timeout. Encrypt at rest because sessions are sensitive.
Model weight blob cache. A per-node local cache of model weights pulled from S3. Not in Redis — stored on local NVMe. Cache-aside on the first pull per node, with a custodial process to evict old models. The cold-start latency driver from Chapter 51.
Feature values for ranking. Read-heavy, written by a batch feature pipeline. Write-through or refresh-ahead to keep the online store in sync with the offline store. Feast and Tecton implement this as part of their architecture (Chapter 91).
User preferences and flags. Read-heavy, updated rarely. Cache-aside with explicit invalidation via pub/sub on update. Long TTL as a backstop.
The pattern is always: identify the read/write ratio, identify the staleness budget, pick the pattern that minimizes backend load while meeting the budget, and add stampede protection for the hot keys. This is not exotic work; it’s careful work.
graph TD
Q1{Write rate?}
Q1 -->|"low — rarely updated"| Q2{Staleness OK?}
Q1 -->|"high — every request"| WB["Write-behind<br/>async flush"]
Q2 -->|"seconds–minutes"| CA["Cache-aside<br/>+ explicit invalidate"]
Q2 -->|"near-zero"| WT["Write-through<br/>cache is write path"]
Q2 -->|"don't care, just fast"| RA["Refresh-ahead<br/>background refresh"]
style WB fill:var(--fig-surface),stroke:var(--fig-border)
style CA fill:var(--fig-accent-soft),stroke:var(--fig-accent)
Picking the right caching pattern is a function of write rate and tolerable staleness — cache-aside covers the majority of cases; write-behind and write-through handle the extremes.
89.10 The mental model
Eight points to take into Chapter 90:
- A cache is a source of truth that lies. The discipline of caching is picking a pattern where the lie is acceptable.
- Cache-aside is the default. Application checks the cache, falls through to the backend, populates. Simple, robust, lazy.
- Write-through sacrifices write latency for tight consistency; write-behind sacrifices durability for write latency.
- Refresh-ahead prevents hot keys from ever missing by refreshing asynchronously before expiry.
- TTL is your freshness budget. Pick based on staleness tolerance. Add jitter to prevent synchronous expiry.
- Thundering herds kill caches. Protect hot keys with singleflight, early-refresh, or stale-while-revalidate.
- Singleflight coalesces duplicate in-flight calls so only one backend call is made per key regardless of concurrency.
- ML serving has a cache per layer: model metadata, embedding cache, KV cache, rate limits, session state. Each has its own pattern.
In Chapter 90 the focus turns to a different flavor of persistent storage: the lakehouse formats that make object storage behave like a database.
Read it yourself
- Martin Kleppmann, Designing Data-Intensive Applications, Chapter 5 on replication and the read/write consistency trade-offs that underlie every caching decision.
- Facebook’s paper Scaling Memcache at Facebook (Nishtala et al., NSDI 2013). The canonical paper on large-scale caching with stampede protection.
- The Redis documentation, especially on Redis Cluster, persistence, and replication.
- The Go
golang.org/x/sync/singleflightsource code — a clean reference implementation of the pattern. - The “Optimal Probabilistic Cache Stampede Prevention” paper by Vattani, Chierichetti, and Lowenstein (VLDB 2015) for the xfetch formula.
- Marc Brooker’s blog posts on caching at AWS, especially on the bimodal failure mode of caches.
Practice
- Implement cache-aside for a user profile lookup with Redis. Include TTL, JSON serialization, and cache invalidation on write.
- Explain the difference between write-through and write-behind with a concrete example where each is appropriate and each is wrong.
- Compute the worst-case staleness window for a cache-aside pattern with a 300-second TTL and an explicit invalidate-on-write. What assumptions does your answer depend on?
- Implement a Redis-based singleflight using
SETNXfor a slow backend call. What happens if the lock holder crashes? - Why does adding jitter to TTLs reduce stampedes? Walk through a scenario where it matters.
- Design the caching strategy for an embedding service that receives ~50% duplicate queries in its hot path. What’s in Redis, what’s in-process, what TTL, what stampede protection?
- Stretch: Build a tiny cache library in Python that supports cache-aside, TTL with jitter, singleflight, and refresh-ahead. Benchmark it against a no-cache baseline under a Zipf-distributed load and measure the cache hit ratio, backend QPS, and p99 latency.