Chapter 36: Disaggregated prefill/decode: production reality with workload-dependent payoff

In Chapter 21 we established the prefill/decode asymmetry: prefill is compute-bound at AI ≈ 700, decode is memory-bandwidth-bound at AI ≈ 1. In every chapter since then we’ve worked within the constraint of running both phases on the same GPU. This chapter covers what happens when you don’t.

Disaggregated serving — running prefill and decode on separate GPU pools, with KV cache transferred between them — was a research idea from 2023-2024 that has now become production reality, especially for vision-language workloads. The technique gives sometimes-large throughput and latency wins, but the payoff is highly workload-dependent, and that nuance is the main thing you need to understand.

Outline:

The asymmetry, restated.
The disaggregation idea.
The production architecture — vLLM’s P2pNcclConnector, the proxy, the scheduler.
Operational gotchas — KServe, NCCL, K8s.
KEDA scaling for disaggregated PD.
The workload-dependent payoff: text vs VL vs batch.
Real benchmark numbers.
When to use disaggregation and when not.
The future of disaggregated serving.

36.1 The asymmetry, restated

From Chapter 21:

Prefill has arithmetic intensity ~700 FLOPs/byte. Compute-bound. Doubling FLOPs would double prefill throughput.
Decode has arithmetic intensity ~1 FLOP/byte. Memory-bandwidth-bound. Doubling HBM bandwidth would double decode throughput.

These two phases have fundamentally different bottlenecks. When they run on the same GPU, each phase competes for the resources the other doesn’t need:

The roofline model makes the asymmetry explicit — decode sits in the memory-bandwidth regime where 99% of H100 compute is wasted, while prefill sits in the compute-bound regime where 99% of HBM bandwidth is wasted.

Prefill needs FLOPs but doesn’t use much HBM bandwidth (its big matmuls stream from registers/L2 efficiently).
Decode needs HBM bandwidth but uses very little compute.

A GPU that’s optimized for one phase is over-spec’d for the other. An H100 has 989 TFLOP/s and 3 TB/s of HBM. Decode uses 3 TB/s of bandwidth and ~30 TFLOP/s of compute (a tiny fraction of the available compute). Prefill uses ~989 TFLOP/s of compute and ~30 GB/s of bandwidth (a tiny fraction of the available bandwidth). Each phase wastes most of what the GPU offers.

If you could run prefill on hardware optimized for compute (lots of FLOPs, less HBM bandwidth), and decode on hardware optimized for bandwidth (lots of HBM, less FLOPs), each would be more efficient. This is the case for disaggregated serving.

Same idea but in latency terms: when a long prefill arrives at a co-located GPU, it blocks decode for the entire prefill duration. Even with chunked prefill (Chapter 23), each chunk of prefill takes some compute that decode would otherwise have. Decode latency suffers from prefill interference. Disaggregation eliminates this interference by physically separating the two phases.

36.2 The disaggregation idea

The architecture:

A pool of prefill workers, each running a copy of the model. These are sized for compute throughput.
A pool of decode workers, each running a copy of the model. These are sized for HBM bandwidth.
A router/proxy in front of both pools. New requests go to a prefill worker first, then their KV cache is transferred to a decode worker, where decode continues.
A fast interconnect between prefill and decode workers (NVLink within a node, RDMA over InfiniBand across nodes). The KV cache transfer has to be fast, or it dominates the gain.

The flow for a single request:

sequenceDiagram
  participant Client
  participant Proxy
  participant PrefillWorker as Prefill GPU
  participant DecodeWorker as Decode GPU
  Client->>Proxy: POST /v1/chat/completions
  Proxy->>PrefillWorker: request (max_tokens=1, encoded with decode addr)
  Note over PrefillWorker: Run prefill, build KV cache
  PrefillWorker-->>DecodeWorker: KV cache transfer (NCCL P2P / NVLink)
  PrefillWorker-->>Proxy: first token
  Proxy->>DecodeWorker: continue request
  loop autoregressive decode
    DecodeWorker-->>Proxy: next token (streaming)
  end
  Proxy-->>Client: SSE stream

Disaggregated serving: the KV cache is transferred once per request from the prefill GPU to the decode GPU via NCCL P2P, then decode runs entirely on the decode worker without ever blocking prefill.

Request arrives at the proxy.
Proxy assigns the request to a prefill worker.
Prefill worker runs the prompt through the model with max_tokens = 1 (just to do the prefill and produce the first token). The KV cache for the prompt is now populated on the prefill worker.
Prefill worker transfers the KV cache to a decode worker via NCCL/RDMA.
Proxy reroutes the request to the decode worker for continuation.
Decode worker generates the rest of the response autoregressively.

The key invariants:

The model weights are loaded on both the prefill and decode workers. There’s no parameter sharing between them.
The KV cache exists on the prefill worker briefly during prefill, then on the decode worker for the rest of the request.
The transfer happens once per request (not per token), so its cost is amortized over the decode phase.

The architecture decouples the two phases. Each worker pool is sized independently:

Prefill workers: optimized for FLOP throughput. Bigger compute, less HBM (you don’t need to hold many KV caches at once because you transfer them out quickly).
Decode workers: optimized for HBM bandwidth. More HBM (to hold KV caches for many in-flight decoding requests), less compute.

In practice, both pools often use the same hardware (H100s) because that’s what’s available, but the configuration is different (more replicas of decode workers if your workload is decode-heavy, more prefill workers if it’s prefill-heavy).

36.3 The production architecture

Let me describe a production-quality disaggregated PD setup, drawing from real implementations in the open-source ecosystem.

vLLM’s P2pNcclConnector

vLLM has supported disaggregated serving since 2024 via the P2pNcclConnector. The component is a CUDA-side helper that uses NCCL P2P (peer-to-peer) primitives to send KV cache directly between GPUs without going through CPU memory.

The setup:

The prefill vLLM instance is configured as a KV producer (writes KV cache out via NCCL).
The decode vLLM instance is configured as a KV consumer (reads KV cache in via NCCL).
A custom request ID encodes both the producer’s and consumer’s NCCL addresses, so the connector knows where to send the data.

The format of the request ID is something like:

___prefill_addr_{zmq_addr}___decode_addr_{zmq_addr}_{uuid}

Where the ZMQ addresses are the network addresses of the prefill and decode instances. When the prefill vLLM finishes the prefill, it parses the request ID, extracts the decode address, and sends the KV cache to that address via NCCL. The decode vLLM receives it and continues generation.

This is an unusual pattern — using the request ID as a routing key — but it works because vLLM’s request ID is opaque enough to encode this information without breaking other code paths.

The proxy

A separate proxy component sits in front of both the prefill and decode pools. Its job:

Receive client requests.
Pick a prefill worker (round-robin or based on load).
Pick a decode worker.
Construct the special request ID with both addresses encoded.
Send the request to the prefill worker first (with max_tokens = 1).
After the prefill completes (and the KV transfer happens automatically), send the request to the decode worker for the rest.
Stream the response back to the client.

The proxy is the orchestration brain. In production implementations, the proxy is often a Quart-based async Python server using ZMQ for service discovery — vLLM instances register themselves with the proxy on startup, and the proxy maintains a list of available prefill and decode workers.

Service discovery via ZMQ

When a vLLM instance starts up in disaggregated mode, it connects to the proxy’s ZMQ ROUTER socket and registers its NCCL address. The proxy maintains a directory of “all the prefill workers” and “all the decode workers” with their network addresses. When a new request comes in, the proxy picks workers from this directory.

This is operationally annoying — every component has to know about every other component — but it’s the simplest way to make the architecture work without a service mesh.

36.4 Operational gotchas

If you set up disaggregated PD on Kubernetes, you’ll hit several gotchas. They’re all solvable, but they’re not obvious.

KServe doesn’t support multi-container InferenceServices well

The standard way to deploy LLM serving on K8s is via KServe’s InferenceService CRD, which wraps a Kubernetes Deployment with model-loading and autoscaling logic. KServe v0.16 panics on multi-container InferenceServices.

For disaggregated PD where prefill and decode need to coexist in the same pod (for GPU reasons — see next point), you have to bypass KServe and create the Deployment, Service, and ScaledObject manually as raw K8s resources. This works but loses some of KServe’s nice features.

The K8s NVIDIA device plugin isolates GPUs per container

The standard NVIDIA device plugin assigns GPUs to containers, and each container only sees its assigned GPUs. This breaks NCCL P2P because NCCL needs both GPUs visible in the same process namespace to set up P2P communication.

The fix: run prefill and decode as separate processes inside the same container. The container requests both GPUs, sees both, and runs two vLLM processes — one with CUDA_VISIBLE_DEVICES=0 for prefill, one with CUDA_VISIBLE_DEVICES=1 for decode. NCCL can now establish P2P between them because they’re in the same container.

This is unusual operationally (single container, multi-process) but it’s the only way to get NCCL P2P to work with the K8s device plugin.

NCCL P2P needs NVLink, not network

NCCL P2P over PCIe is much slower than over NVLink. For the disaggregation to be worthwhile, the two GPUs must have an NVLink connection — which means they’re in the same node. Cross-node disaggregation needs RDMA, which is slower and more complex.

Set NCCL_IB_DISABLE=1 to force NCCL to use NVLink instead of falling back to InfiniBand for intra-node communication.

Ranking the gotchas

These are real operational complexities that have caught teams off-guard. The KServe issue alone has caused multiple production incidents. The single-container-multi-process pattern is non-obvious and brittle. The whole architecture demands more operational sophistication than co-located serving.

36.5 KEDA scaling for disaggregated PD

Scaling disaggregated PD with KEDA (Chapter 51) has its own twist.

The standard scaling metric for vLLM is vllm:num_requests_running, which counts the number of requests currently being processed. For a co-located vLLM, this is unambiguous: each request goes through prefill → decode on the same instance, so you count it once.

For disaggregated PD, the same request shows up on both the prefill and decode instances during its lifetime. If you naively scale on num_requests_running summed across both pools, you’d double-count requests, leading to over-scaling.

The fix: scale on the decode instance’s metric only. The decode instance is where the long-running work happens (generating tokens). The prefill instance’s work is brief and amortized. The number of requests in decode is the right signal for “are we under-provisioned?”

In a production deployment, the KEDA ScaledObject for the decode pool uses vllm:num_requests_running from the decode instance’s port (not the prefill’s), and the prefill pool is scaled independently based on prefill-specific metrics or just kept at a fixed ratio to the decode pool.

36.6 The workload-dependent payoff

Here’s the part that the disaggregation papers don’t always emphasize: the payoff varies wildly by workload. For some workloads, disaggregation is a major win. For others, it’s a wash or even a loss.

In co-located serving, a long prefill blocks all in-flight decodes for its duration; disaggregation lets the prefill GPU and decode GPU run their respective workloads in parallel, cutting p50 latency by 40–57%.

The three workload categories:

Short text (< 500 input tokens)

For typical chat workloads with short prompts:

Per-GPU throughput gain: ~0% (or slightly negative due to NCCL overhead).
Latency reduction: ~40-55% (because prefill no longer blocks decode).

The throughput is unchanged because you’re using 2× the GPUs for similar total work. You get faster individual responses (better p50/p99 latency) but no efficiency improvement. Use disaggregation only if latency matters more than cost.

Long text (> 2K input tokens)

Per-GPU throughput gain: ~10-30% (estimated, varies by model and length).
Latency reduction: ~50-70%.

Longer prefills mean more interference in the co-located case. Disaggregation removes that interference, giving both better latency and modest throughput gains. Worth using for RAG-heavy or long-context workloads.

Vision-language (with images)

Per-GPU throughput gain: ~30-50% for prefill-bound workloads.
Latency reduction: ~50-70%.

VL workloads have huge prefills (each image is 200-2000+ tokens, see Chapter 32). The prefill phase is so expensive that it dominates the request, and disaggregation lets the prefill GPU stay continuously busy with images while the decode GPU stays continuously busy with token generation. Strong candidate for disaggregation.

Batch / offline

Per-GPU throughput gain: negative.
Latency: irrelevant.

For batch workloads where latency doesn’t matter, disaggregation is wasteful. You’re paying for 2× the GPUs to do the same total work, just with less interference. Just add more co-located replicas instead.

The summary table

Workload	Per-GPU throughput	Latency	Recommendation
Short text (< 500 tokens)	~0%	-40-55%	Use only if latency matters
Long text (> 2K tokens)	+10-30%	-50-70%	PD worthwhile
Vision-language	+30-50%	-50-70%	Strong PD candidate
Batch / offline	Negative	N/A	Don’t use PD

This table is the most important thing in this chapter. PD is most valuable when prefill is the bottleneck. If your workload is decode-heavy, just add more co-located replicas; you’ll get the same benefit cheaper.

36.7 Real benchmark numbers

A real-world benchmark from a production deployment using vLLM with disaggregated PD on Qwen3-VL-4B-Instruct (vision-language model), text-only mode, 256 max output tokens, 20 requests:

Throughput (tokens/second):

Concurrency	Baseline 1 GPU	PD 2 GPUs	PD per-GPU	Per-GPU change
1	67.2	164.1	82.1	+22%
4	282.2	593.6	296.8	+5%
8	462.8	825.4	412.7	-11%

Latency p50 (seconds):

Concurrency	Baseline	PD	Change
1	3.50	1.51	-57%
4	3.61	1.62	-55%
8	3.65	2.09	-43%

The pattern is clear: PD wins on latency at all concurrency levels (40-57% reduction), but per-GPU throughput is roughly even with the baseline, becoming negative at high concurrency where the NCCL overhead dominates.

For text-only on a small VL model, this means PD is a latency optimization, not a throughput optimization. But the same model on actual VL workloads (with images) shows much bigger gains because the per-image prefill is so expensive. The throughput-vs-baseline ratio inverts.

36.8 When to use disaggregation

The decision tree:

Use disaggregated PD when:

Latency SLAs matter (TTFT or p99 latency) and you can pay for 2× the GPUs.
The workload is prefill-bound: VL with images, long documents, RAG with large contexts, mixed workload where long prefills would block short requests.
You have fast GPU interconnect (NVLink within a node, or RDMA across nodes).
You have the operational sophistication to deploy and monitor a multi-component architecture.

Don’t use disaggregated PD when:

The workload is decode-bound (short prompts, long generations).
You’re optimizing for cost at fixed quality (just add co-located replicas).
You’re on hardware without fast GPU interconnect.
You don’t have the operational team to manage the additional complexity.

For most production deployments today, the right answer is co-located PD with chunked prefill (Chapter 23). Disaggregation is for specific high-value cases.

36.9 The future of disaggregated serving

Where the field is going:

More frontier labs are adopting disaggregation for their largest models. The compute and memory savings are real at frontier scale.
vLLM’s disaggregation support is maturing, with cleaner APIs and better operational tooling.
SGLang has its own disaggregation mode with similar trade-offs.
NIXL (NVIDIA Inference Xfer Library) is NVIDIA’s KV transfer library for disaggregation, designed to handle the cross-node case efficiently.
Hardware-specific optimizations — Hopper and Blackwell have features (TMA, asynchronous matmul) that disaggregation can exploit.

The long-term direction: disaggregation will become the default for prefill-heavy workloads (VL, long-context, RAG-heavy), while co-located serving remains the default for short-context chat. The line between “needs disaggregation” and “doesn’t need it” will move as the techniques mature.

36.10 The mental model

Eight points to take into Chapter 37:

Prefill and decode are different workloads with different bottlenecks. Co-locating them wastes hardware.
Disaggregated serving runs prefill and decode on separate GPU pools with KV cache transfer between them.
Production implementation uses NCCL P2P over NVLink for fast KV transfer, with a proxy for routing.
K8s gotchas — KServe doesn’t support multi-container InferenceServices cleanly; the device plugin forces single-container-multi-process.
KEDA scaling uses the decode instance’s metric only to avoid double-counting.
The payoff is workload-dependent. Short text: latency only. Long text: throughput gains. VL: big throughput gains. Batch: don’t bother.
Real benchmark: disaggregation cuts latency 40-55% on VL workloads with no per-GPU throughput change, or 30-50% throughput gains on prefill-bound workloads.
Use it for prefill-heavy, latency-critical workloads. Don’t use it for batch or pure decode-heavy workloads.

In Chapter 37 we look at the related technique of KV cache offload — moving cache to slower memory tiers when GPU memory is exhausted.

Read it yourself

Zhong et al., DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving (2024). The seminal paper.
Patel et al., Splitwise: Efficient Generative LLM Inference Using Phase Splitting (2024).
The vLLM disaggregation documentation.
The SGLang documentation on prefill-decode separation.
The NCCL P2P documentation from NVIDIA.
Real production write-ups on disaggregated PD architectures (search for “vLLM disaggregated” on GitHub for actively maintained examples).

Practice

Why does disaggregation give bigger gains for VL workloads than for short-text workloads? Trace the prefill and decode costs in both cases.
The K8s device plugin breaks NCCL P2P by isolating GPUs per container. Walk through why and explain the single-container fix.
KEDA scaling for disaggregated PD uses the decode instance’s metric only. Why? What goes wrong if you sum across both pools?
Compute the expected throughput improvement from disaggregating a VL workload where prefill takes 5 seconds per request and decode takes 1 second per request.
Why is NVLink essential for disaggregated PD, and what happens if you only have PCIe?
Read the DistServe paper’s experimental section. Identify the workloads where disaggregation wins biggest. Compare to the table in §36.6.
Stretch: Set up a disaggregated PD configuration with vLLM on a multi-GPU machine (you can simulate the proxy). Measure latency and throughput vs co-located PD on a workload of your choice.

Concept check