Chapter 77: Backpressure, flow control, and queue theory

Rate limiting (Chapter 76) is the hard edge: above the limit, requests are rejected. Backpressure is the soft edge: as a system approaches saturation, it communicates that reality upstream so producers slow down before anything is rejected. Done well, backpressure keeps a system at high utilization without ever tipping into overload. Done badly — or not done at all — a system accumulates queues until latencies explode, memory runs out, and everything falls over at once.

This chapter covers the math (Little’s Law and why it matters), the anti-pattern (unbounded queues), the distinction between backpressure and load shedding, the canonical model (Reactive Streams), and the operational techniques that make flow control actually work. Every distributed system has queues. Queue theory tells you what they cost.

Outline:

Every system is a queueing system.
Little’s Law and what it actually says.
The latency-utilization curve.
Why unbounded queues are dangerous.
Backpressure vs load shedding.
The bufferbloat analogy.
Reactive Streams as the canonical model.
How backpressure propagates upstream.
Techniques: bounded queues, token grants, pause/resume.
Backpressure in ML serving specifically.
Failure modes.
The mental model.

77.1 Every system is a queueing system

A service receives requests at some rate, processes them with some capacity, and hands results back. If the arrival rate exceeds the processing rate, requests pile up — in a queue, a channel, a socket buffer, a thread pool’s blocking queue, somewhere. If the arrival rate is less than capacity, the server is idle some fraction of the time. If they match, the server runs at 100% utilization.

This is the object of queueing theory, and it is not an optional branch of math for distributed systems engineers. Every server, every message broker, every gRPC stream, every HTTP connection pool has a queue in it somewhere. Understanding the behavior of those queues is the difference between “the system gets slower under load” (expected) and “the system enters a death spiral at 80% utilization” (the thing queue theory predicts and naive engineers are surprised by).

The three quantities that describe a queue:

λ (lambda) — the arrival rate, in requests per second.
W — the average time a request spends in the system (queue + processing).
L — the average number of requests in the system at any moment.

And the single universal relationship:

L = λ × W

That is Little’s Law, and it is the closest thing to Newton’s laws queueing theory has.

77.2 Little’s Law and what it actually says

Little’s Law (John Little, 1961) says: in any stable queueing system, the average number of items in the system equals the average arrival rate times the average time each item spends in the system. It holds regardless of the arrival distribution, service distribution, number of servers, scheduling discipline, or anything else. It is almost eerily general.

The derivation is nearly trivial: over a long time window T, the number of arrivals is λT, and each spent an average of W seconds in the system. The total “customer-seconds” contributed to the queue is λT × W. Averaged over T, the average number in the system is (λT × W) / T = λW. That is the whole proof. The subtlety is that “in the system” and “arrival rate” must be defined consistently — if W counts only queue time, L counts only queued items; if W counts queue + service, L counts queued + in-progress items. Pick one and be consistent.

What Little’s Law lets you compute:

Given throughput and latency, compute in-flight count. A service handles λ = 1000 RPS with average W = 200 ms. Then L = 1000 × 0.2 = 200 in-flight requests on average. If you were provisioning 100 concurrent worker slots, you are under-provisioned. If you provisioned 500, you are wasting capacity.

Given concurrency and latency, compute throughput. A service has L = 100 concurrent slots and W = 50 ms per request. Then λ = L / W = 100 / 0.05 = 2000 RPS. This is the ceiling; no amount of “optimization” to the scheduler can push you past it.

Given concurrency and throughput, compute latency. A service has L = 200 slots and handles λ = 500 RPS. Then W = L / λ = 200 / 500 = 400 ms. Every request takes 400 ms on average. If you thought it was 50 ms, the rest is queue time.

Applied to LLM serving: a vLLM replica with max concurrency of 64 handling 10 requests per second implies W = 6.4 seconds per request on average. If your SLA is W < 3 seconds, you have at most L = λ × W = 10 × 3 = 30 in-flight requests, well below the 64-slot ceiling — but if you went above 30 in-flight, you would violate the latency SLA. Little’s Law is the capacity-planning equation for any throughput-bound system.

77.3 The latency-utilization curve

Little’s Law is agnostic to utilization — it just relates L, λ, W. But queueing theory also tells you how W grows with utilization, and this is the part engineers need to internalize. For a simple M/M/1 queue (Poisson arrivals, exponential service time, one server), the expected time in system is:

W = 1 / (μ - λ)     where μ is the service rate

Let ρ = λ/μ be the utilization (fraction of capacity used). Then:

W = (1/μ) × 1/(1 - ρ)

The first factor is the raw service time. The second, 1/(1 - ρ), is the queueing multiplier. At ρ = 0.5 the multiplier is 2 — the average request waits as long as it takes to service. At ρ = 0.8 it is 5. At ρ = 0.9 it is 10. At ρ = 0.95 it is 20. At ρ = 0.99 it is 100.

Operating at 90% utilization means a 10% traffic spike produces 10× latency; the safe operating range tops out around 70%.

This is the curve that kills systems. At low utilization it is flat; at high utilization it is a cliff. The region from 70% to 90% utilization is where W doubles and then triples. Above 90%, W grows without bound. The service is still processing requests — it hasn’t crashed — but the queue is growing faster than it drains, and every request added makes everything already in the queue slower.

Real systems are not pure M/M/1; they have multiple servers, different distributions, more complicated service. But the shape of the curve is universal: low utilization is cheap, high utilization has a cliff. The practical implication: target 60–70% average utilization, not 95%. You are paying for capacity you never use on average, but you are buying yourself a buffer for when demand spikes, and the alternative (running at 95% utilization) means a 10% traffic spike takes you into the asymptote and kills you.

This is also why tail latencies are worse than averages (Chapter 31). At 90% utilization, the average wait might be tolerable, but the tails — the p99, p999 — are much worse because they sit where the queue is deepest. A request that shows up during a momentary burst waits behind a transient backlog that the average does not see.

77.4 Why unbounded queues are dangerous

The most common mistake in system design: “I’ll just have a queue in front of the service to absorb bursts.” If the queue is unbounded, the burst absorber becomes a latency amplifier. The math:

The queue holds incoming requests while the service drains them at capacity.
If λ > μ even briefly, the queue grows.
If the queue is unbounded, it grows as long as λ > μ.
When λ drops back to μ, the queue stays full; draining takes queue_size / (μ - λ) seconds of below-capacity traffic to clear.

During the drain, every request sees the full queue as latency. A 10-second burst that fills a queue with 10,000 requests takes minutes to drain, and every request during that minutes sees 10 seconds of added latency. The client has long since timed out and retried; the retry adds to the queue; the system never recovers.

This is the queue-induced latency collapse. It is what happens when you “buffer away the problem” instead of shedding load. The buffer turns an overload into a cascade.

The fix is bounded queues. Pick a queue size that you can actually drain in a time shorter than the client timeout. If clients time out after 5 seconds and the service processes 1000 req/s, the queue should be bounded at 5000 at most — beyond that, the requests in the queue will time out before they are served, and processing them is wasted work. Any request that arrives when the queue is full gets rejected immediately (with 429 or 503), which the client sees as a fast failure rather than a slow timeout.

The other fix is backpressure: tell the producer to slow down before the queue fills. If the queue is at 80% full, emit a signal that says “stop, I can’t take more right now.” The producer sees the signal, pauses its submissions, and the queue drains. No request is rejected; the producer just waits until the consumer is ready. This is what the next sections are about.

77.5 Backpressure vs load shedding

Two distinct responses to overload, often conflated.

Load shedding rejects excess load at the ingress. When the system is at or over capacity, incoming requests are refused with an error (429, 503). The client has to retry or give up. Load shedding preserves the system at the cost of some requests.

Backpressure tells producers to slow down, cooperatively. The producer observes a signal from the consumer (an explicit pause, a slower ack, a decreased credit window) and reduces its submission rate voluntarily. No requests are rejected; the producer’s pipeline just runs slower. Backpressure preserves both the system and the requests, at the cost of producer throughput.

Use backpressure between trusted internal components where the producer can slow down cooperatively; use load shedding at the boundary with untrusted external traffic.

Both are necessary; they handle different situations.

Backpressure is appropriate when the producer can be trusted to slow down — usually because both producer and consumer are inside the same trust boundary (same team, same service mesh, same codebase). Producer-consumer pairs inside a single pipeline (stream processing stages, gRPC bidirectional streams, TCP) use backpressure. The connection between a Flink source and its downstream operator is backpressured; the connection between a gRPC client stream and a server handler is backpressured.

Load shedding is appropriate when the producer cannot be trusted, either because it is an external user (the internet) or because the pipeline is one-directional (no way to signal “slow down” upstream). API gateways shed load at the rate limit. Databases shed load by returning “too busy” errors when the connection pool is exhausted. An external HTTP API cannot backpressure browsers; it has to shed.

The right answer is usually both: backpressure between trusted components inside the system, load shedding at the boundary with untrusted outside traffic. A gateway rate-limits (sheds) external requests to stay within capacity, then backpressures internally across the service graph to keep the graph in steady state.

Interviewers test this by asking: “Your system is slowing down under load. Do you add a bigger queue?” The answer is no. You bound the queue, add backpressure upstream of it, and shed load at the boundary. A bigger queue makes things worse.

77.6 The bufferbloat analogy

Networking ran into this same problem in the 2000s and gave it a name: bufferbloat. Consumer routers were shipped with huge packet buffers, tens of megabytes deep, under the theory that bigger buffers absorb bigger bursts. In practice, the buffers became reservoirs of delayed packets. TCP congestion control, which relies on packet drops to infer congestion and slow down, stopped working — packets were never dropped, they were just buffered for seconds. Latency went through the roof; the network became unusable for anything interactive (VoIP, gaming, SSH) even while bulk transfers still worked.

The fix was AQM — Active Queue Management, algorithms like CoDel and PIE that drop or mark packets before the buffer fills, signaling congestion to TCP so it slows down. The key insight: the buffer’s job is not to absorb overload; it is to smooth out short-term variations. Anything longer than that should be dropped so the sender knows to back off.

Exactly the same insight applies to application queues. A queue that absorbs a 100-ms burst is helping. A queue that absorbs a 10-second burst is harming — those requests will sit in the queue and their clients will time out, and the clients will retry, and the retries will pile on. The mental model: queues are shock absorbers, not reservoirs. They exist to smooth milliseconds, not seconds. Bound them to that purpose.

The bufferbloat analogy is why engineers who understand networking understand application backpressure intuitively. The same math, the same failure mode, the same fix. If you have not read Jim Gettys’ bufferbloat writeups, they are a fantastic entry point to the entire topic.

77.7 Reactive Streams as the canonical model

Reactive Streams is a specification (2015, by Netflix, Pivotal, Red Hat, Twitter) for asynchronous stream processing with non-blocking backpressure. It is the canonical model — the interface that every backpressured streaming library implements, whether it is RxJava, Project Reactor, Akka Streams, or the Node.js stream module’s object mode.

The key interface is the Publisher-Subscriber pair:

The subscriber drives the rate via request(N); a publisher that ignores demand and pushes freely defeats the backpressure and causes queue buildup.

interface Publisher<T> {
    void subscribe(Subscriber<? super T> s);
}

interface Subscriber<T> {
    void onSubscribe(Subscription s);
    void onNext(T t);
    void onError(Throwable t);
    void onComplete();
}

interface Subscription {
    void request(long n);    // ← the backpressure signal
    void cancel();
}

The crucial method is Subscription.request(n). The subscriber calls this to say “I can accept n more items.” The publisher must not send more items than the subscriber has requested. If the subscriber never requests more, the publisher stops. This is demand-driven flow: the consumer pulls work at its own pace, the producer only pushes when the consumer has signaled capacity.

The flow for a slow consumer:

Subscriber subscribes and calls request(10) — “give me 10 items.”
Publisher sends 10 items via onNext().
Subscriber processes them. While it processes, no items flow.
After processing, subscriber calls request(10) again.
Publisher sends the next 10.

If the consumer is slow, the producer stalls at step 2 waiting for the next request. The backpressure is automatic and cooperative; no queues grow without bound, no explicit “pause” signal is needed.

Reactive Streams is the reference implementation of backpressure. Every modern streaming system implements something equivalent. The JVM has it in java.util.concurrent.Flow. Node.js streams have .pause() / .resume(). gRPC bidirectional streams expose it via flow control. Kafka consumers expose it via poll() batches. The specific API varies; the model is the same: the consumer requests, the producer honors.

The implementations come with subtleties — a request(Long.MAX_VALUE) opts out of backpressure (which defeats the point, and should be a code smell), buffering strategies vary (drop oldest, drop newest, latest, none), and composing multiple publishers requires merge strategies that preserve backpressure. The documentation for Project Reactor and RxJava covers these well; it is worth reading once even if you are not a JVM developer.

77.8 How backpressure propagates upstream

A single pair of producer and consumer is trivially backpressured. The interesting case is a long pipeline: source → stage 1 → stage 2 → stage 3 → sink. If the sink slows down, the backpressure must propagate all the way back to the source, or intermediate stages will pile up.

The propagation happens transitively. The sink’s backpressure tells stage 3 to slow down. Stage 3’s input queue fills up, so stage 3 cannot accept more items. Stage 2 notices that stage 3 is not accepting, stops reading from its own input, and stalls. Stage 1 notices stage 2 is stalled and stops reading from the source. The source, unable to push into stage 1, stalls at the point where it pulls input from wherever it came from.

For this to work, every stage must honor the backpressure of its downstream. A stage that internally spawns an unbounded goroutine pool or a fire-and-forget worker breaks the chain — items land in the worker pool without being reflected in the stage’s visible backpressure, and upstream sees no signal even though downstream is drowning. The single most common bug in hand-rolled backpressure is “oh, this stage uses an unbounded channel internally.”

In gRPC streams, backpressure propagates via HTTP/2 flow control. Each stream has a flow-control window; when it fills, the sender stops. HTTP/2 automatically propagates window updates as the receiver drains. If a client cannot process fast enough, the server’s stream.Send call blocks until the window opens again. The application code doesn’t see it as backpressure explicitly — it sees its Send call taking longer — but the effect is correct. This is why gRPC streams compose backpressure naturally through long chains.

In Kafka, backpressure is different: the producer never blocks on the broker, but the consumer controls the pace via poll batch sizes. A slow consumer does not cause the producer to block — the producer keeps writing, and the consumer falls behind (lag grows). The “backpressure” is that the consumer lag signals to operators that they need to either speed up the consumer or slow down the producer. It is not automatic; it requires monitoring and manual adjustment. Kafka’s model is “infinite buffer,” which is great for decoupling but leaves the backpressure work to you.

In HTTP (non-streaming), there is no backpressure at the protocol level. Servers shed load with 429/503, clients retry. Backpressure in an HTTP service graph is emulated with concurrency limiters (bound the in-flight request count) plus circuit breakers (fail fast when a dependency is slow), which together have a similar effect: if a downstream is slow, your in-flight count climbs, your concurrency limiter rejects new requests, and the signal propagates upstream as 503s.

77.9 Techniques: bounded queues, token grants, pause/resume

Concrete techniques for implementing backpressure in code.

Bounded queues. The simplest primitive. A queue with capacity N; put blocks when full, take blocks when empty. Producers naturally slow down to consumer speed. In Go, chan T with a buffer size. In Java, ArrayBlockingQueue. In Python, asyncio.Queue(maxsize=N). When a put blocks, the producer is backpressured. The size N should be small (10–100 items), enough to smooth jitter but not enough to hide problems.

Token grants / credit systems. The consumer hands out tokens to producers; each submission consumes a token. When tokens run out, the producer waits. The consumer refills as it processes. This is Reactive Streams’ request(n) model. It is slightly more complex than a bounded queue but gives the consumer explicit control over when to admit more work — useful when the consumer wants to vary its pace based on internal state (e.g., downstream health).

Pause / resume. The consumer explicitly signals the producer to pause when overloaded and resume when recovered. Node.js streams expose this; TCP uses it at the transport layer. The downside is that the producer needs to trust the consumer’s pause signal and react promptly. Works well when producer and consumer are inside the same process or connected by a reliable channel.

Concurrency limiters. Not a queue, a gate. Cap the number of concurrent in-flight items at N. New items wait (or are rejected) until an in-flight item completes. Netflix’s concurrency-limits library implements this with adaptive limits (AIMD, Gradient) that tune N based on observed latency. A bounded concurrency is equivalent to Little’s Law in reverse: you fix L and let λ fall out based on W.

Semaphores. A semaphore with N permits. Each operation acquires a permit before starting, releases on completion. If no permits are available, the caller blocks. Identical to a concurrency limiter but lower-level; a semaphore is the primitive you build a concurrency limiter out of.

Deadline-based shedding. Every request carries a deadline. When a worker picks up an item, it checks if the deadline has already passed; if yes, it drops the request immediately. This does not prevent queuing but ensures that queued work that has become pointless is skipped. Combined with bounded queues, it prevents “drain the backlog while every request in it times out anyway” scenarios.

Priority queues. Not a backpressure technique per se, but related. When queues build up, serve higher-priority requests first. Prevents important work from being stuck behind bulk work. Essential in multi-tenant systems where latency-sensitive calls should not wait on batch calls.

These techniques are composed in real systems. A typical gRPC server uses bounded channels for its request pipeline, a concurrency limiter at the edge of the server, a semaphore around any external call, and deadline-based shedding inside every handler. Backpressure is not one mechanism; it is the cumulative effect of several mechanisms applied at every stage.

77.10 Backpressure in ML serving specifically

LLM and ML serving has a twist on the general backpressure story: requests are not uniform. A short-prompt short-output request takes 200 ms. A long-prompt long-output request takes 10 seconds. A simple concurrency limiter sized for the short case underutilizes the GPU; sized for the long case lets the short case sit behind slow work. Pure Little’s Law math does not capture this variance.

The answer is the scheduler inside the serving engine (continuous batching, Chapter 23). vLLM and its peers do not use a bounded request queue at the API layer alone; they let the scheduler pull requests into an active batch based on live KV cache availability and pending decode capacity. The scheduler itself is the backpressure mechanism — if the KV cache is full, the scheduler refuses to admit new requests, and they queue at the API layer. The engine’s num_requests_waiting metric (Chapter 51) is the visible signal that backpressure is happening; the HPA/KEDA autoscaler reacts by adding replicas.

Practical implications:

The API-layer queue should be small. The engine has its own scheduler; a large queue at the HTTP layer just adds latency without improving throughput. A queue of 5–20 requests is usually enough.
The deadline must propagate. A request that has been waiting so long that its client has given up should be dropped, not decoded. vLLM supports abort for this; use it.
Admission control at the gateway. Before a request even reaches the engine, check if the engine is already at num_requests_waiting > threshold and shed load (429) if so. This prevents the scheduler from being ambushed by a burst.
Concurrency is not a scalar. A GPU has a KV cache budget; each request consumes a variable fraction of it. The right “concurrency limit” is expressed in KV-cache-blocks, not requests. The engine handles this internally; external limiters should not try to override it.
Backpressure signals from the engine to the autoscaler. vLLM exports num_requests_waiting, which is effectively “how backpressured am I?” KEDA reads it and scales up. This is backpressure translated into horizontal scaling — a different flavor from the stream-processing model but the same insight.

For embedding or reranker services, the story is simpler: each request is roughly uniform, so a simple bounded queue works. For LLM services, the scheduler is the backpressure, and your job is to feed it well and listen when it pushes back.

77.11 Failure modes

Hidden unbounded queues. Someone added a go func() with a channel but forgot to bound it. Or used asyncio.create_task without awaiting. Or put requests into an in-memory list. The queue grows until OOM. Fix: code review for unbounded data structures; lint rules; always bound channels/queues at creation.

The retry storm. Clients retry aggressively on 429. Every rejected request turns into 3 retries. The effective arrival rate triples when the system is already overloaded. Death spiral. Fix: exponential backoff with jitter, Retry-After honored, retry budgets.

Queue drained in wrong order. FIFO queues under heavy load mean the oldest (and nearest to timeout) requests are served first. By the time they are served, they have timed out. Meanwhile fresh requests at the back of the queue are starving. Fix: LIFO for latency-sensitive loads; or drop from the head of the queue when it fills.

Deadlines not propagated. An edge handler has a 5-second deadline but downstream services get the request with no deadline and work for minutes on a request that nobody is reading. Fix: pass deadlines as context; shed at each stage based on remaining time.

“Just add more memory.” The queue is filling up, so someone increases the queue size. The system handles the immediate burst but the latency is now twice as bad. Fix: understand the bufferbloat lesson. Bigger queues are worse, not better.

Misaligned timeouts. Client timeout is 10 s, server timeout is 30 s, downstream timeout is 60 s. The server keeps working on requests the client has given up on, wasting capacity. Fix: timeouts should decrease as you go down the stack, not increase.

Backpressure absorbed by intermediate layer. Gateway has its own buffer; buffer absorbs backpressure from downstream; producers never see the signal; eventually the gateway runs out of memory. Fix: every layer bounded, backpressure propagates end to end.

Autoscaler fights the limiter. Concurrency limiter throttles to 100 in-flight, autoscaler sees backpressure and scales out, now 100 replicas × 100 in-flight = 10,000 concurrent requests to the downstream, which melts. Fix: coordinate limits across autoscaling decisions; use adaptive concurrency that scales down as replicas scale up.

No visibility. Backpressure is happening somewhere in the pipeline, but you cannot tell where. Latency climbs, but every dashboard shows “normal.” Fix: instrument queue depths and rejection rates at every stage. A queue-depth metric with no alert is the first line of defense.

77.12 The mental model

Eight points to take into Chapter 78:

Every system has queues; queueing theory is not optional. Little’s Law (L = λW) is the universal relationship between throughput, latency, and concurrency.
Latency grows as 1/(1-ρ). The cliff from 80% to 95% utilization is not linear; plan for 60–70% average.
Unbounded queues turn overloads into cascades. Bound every queue at a size that drains within the client timeout.
Backpressure and load shedding are distinct. Backpressure slows producers cooperatively; load shedding rejects at the boundary. Use both.
Bufferbloat is the networking name for the same anti-pattern. Queues are shock absorbers for milliseconds, not reservoirs for seconds.
Reactive Streams is the canonical backpressure model. The consumer requests demand; the producer never pushes unrequested items.
Backpressure must propagate end-to-end. One layer with an unbounded buffer breaks the chain for everything upstream of it.
In LLM serving, the engine’s scheduler is the backpressure. Listen to its metrics, shed at the gateway, propagate deadlines, scale horizontally on the signal.

In Chapter 78 the question becomes: when retries or duplicates do happen, how does the system avoid applying the same side effect twice? Idempotency keys and the practical approximation of exactly-once.

Read it yourself

John Little, “A Proof for the Queuing Formula: L = λW” (1961) — the two-page paper that states the theorem.
Neil J. Gunther, The Practical Performance Analyst — a working engineer’s queueing theory book.
Jim Gettys et al., “Bufferbloat: Dark Buffers in the Internet” (ACM Queue, 2011). The canonical explanation of the networking version of the same problem.
The Reactive Streams specification (reactive-streams.org).
Netflix Tech Blog, “Performance Under Load” — the concurrency-limits library and adaptive concurrency.
Marc Brooker’s blog, especially “Exponential Backoff And Jitter” and queueing-related posts. One of the clearest voices on this topic.

Practice

A service handles 500 RPS with average latency 80 ms. How many requests are in flight at steady state?
You have 100 worker slots and each request takes 50 ms on average. What is the maximum sustainable throughput?
A system hits 95% utilization. Latency is 10× baseline. Predict the latency at 99% utilization from the M/M/1 formula.
A queue in front of your service is unbounded. Traffic exceeds capacity by 10% for 30 seconds, then returns to baseline. Describe what happens to latency during and after the burst.
Explain the difference between backpressure and load shedding, and give one scenario where each is the right choice.
Why is a request’s deadline important even when the system is backpressured? Construct a case where failing to check the deadline wastes capacity.
Stretch: Write a small Go or Python program with a producer and a consumer connected by a bounded channel. Run the producer faster than the consumer and verify that the producer blocks at the correct rate. Then swap to an unbounded channel and watch memory grow.