Part III · Inference Internals & Production Serving
Chapter 28 Intermediate ~17 min read

Tensor, pipeline, expert, and sequence parallelism for inference

"Training parallelism is for memory. Inference parallelism is for memory and latency"

We covered the four parallelism dimensions (DP, TP, PP, SP) for training in Chapter 12. The inference setting is similar but with a critical difference: the optimization target is different. Training optimizes for total throughput (tokens/sec across the cluster); inference optimizes for some weighted combination of per-request latency and total throughput.

This chapter covers how each parallelism dimension is used at inference time, what the trade-offs look like, and why the production answer is “TP within a node, PP across nodes when the model is too big for a node” — almost exactly the opposite of what you might guess if you’ve only thought about training.

Outline:

  1. The inference parallelism problem.
  2. Tensor parallelism for inference.
  3. Pipeline parallelism for inference.
  4. Expert parallelism (MoE).
  5. Sequence parallelism for inference.
  6. Data parallelism at inference time (replicas, not parallelism).
  7. The “fits in one GPU” question.
  8. Combining: TP × PP for big models.
  9. The production reality.

28.1 The inference parallelism problem

The two reasons to use parallelism at inference:

(1) Memory. A 70B model in bf16 doesn’t fit on a single 80 GB H100 (it needs 140 GB just for the weights). You need to split the model across multiple GPUs. This is the same as in training: parallelism for memory.

(2) Latency. Even if the model fits on one GPU, you might want to split it across multiple GPUs to process each request faster. Two GPUs computing one matmul together can do it in roughly half the time. This is not a goal of training (training cares about throughput, not single-batch latency), and it’s the main reason inference parallelism choices look different.

The third use case from training, throughput, is handled at inference by replication (data parallelism): run multiple independent copies of the model in parallel, and route requests to whichever copy is least loaded. This is operationally simpler than the parallelism within a single replica.

So inference parallelism is about:

  • Splitting one model across multiple GPUs (for memory and latency).
  • Replicating the resulting “shards” multiple times (for throughput).

The first is the interesting part. Let’s look at how each parallelism dimension serves it.

28.2 Tensor parallelism for inference

Tensor parallelism (TP) at inference time is structurally identical to training (Chapter 12): split each weight matrix across GPUs, run matmuls in parallel, all-reduce after each block.

The Megatron-style pattern from Chapter 12 still applies:

  • Column parallelism on the first matmul (QKV projection, FFN gate/up).
  • Row parallelism on the second matmul (attention output projection, FFN down).
  • One all-reduce per sub-block (one for attention, one for FFN).
Tensor parallelism: column-parallel first matmul splits weight columns across GPUs; row-parallel second matmul splits rows; one all-reduce combines the result. X replicated on all GPUs GPU 0 W₁[:, :d/2] col shard GPU 1 W₁[:, d/2:] col shard Y₁ = X@W₁[:,:d/2] Y₂ = X@W₁[:,d/2:] GPU 0 W₂[:d/2, :] row shard GPU 1 W₂[d/2:, :] row shard partial Z₀ partial Z₁ all-reduce Z₀ + Z₁ NVLink Z full output replicated One all-reduce per transformer sub-block (attention + FFN). With NVLink, this is ~16 KB in decode — essentially free.
Tensor parallelism splits weight columns then rows across GPUs and finishes with one all-reduce per block — with NVLink the communication cost is negligible in decode, giving near-linear latency reduction with TP degree.

For Llama 3 70B with TP=4 on a single node:

  • Each GPU holds 1/4 of the weights: ~35 GB per GPU.
  • Each block does two all-reduces over the 4 GPUs.
  • The all-reduce is over an (N, S, D) tensor — for N=1, S=1, D=8192 in decode, that’s only 16 KB. Negligible time over NVLink.

The win: TP cuts decode latency roughly linearly with the number of GPUs, up to the point where the all-reduce cost starts to dominate. For a single user generating tokens, TP=4 is roughly 4× faster than TP=1 (modulo communication overhead). This is why TP is the default form of parallelism for inference of models that don’t fit on one GPU and for models that fit but you want lower latency.

The cost: TP requires fast intra-node interconnect. NVLink (~900 GB/s on H100 NVLink 4.0, ~900 GB/s on H200) makes the all-reduces essentially free. PCIe (~32 GB/s) is too slow. Cross-node TP is generally a bad idea — the all-reduce over InfiniBand (~200 GB/s) takes long enough to hurt latency significantly.

The rule: TP within a node, only. A typical 8-GPU H100 node uses TP=8 for big models, TP=4 or TP=2 for medium models. Cross-node parallelism uses pipeline parallelism instead.

How TP affects the matmul shapes

A subtle point. Recall from Chapter 21 that decode is memory-bandwidth-bound because the matmul shapes are (1, d_in) → (1, d_out). With TP=K, the matmul becomes (1, d_in) → (1, d_out / K) per GPU.

The shape is even smaller per GPU. The arithmetic intensity is even lower. TP makes individual GPUs more memory-bound, not less. But the wall-clock per token decreases because all K GPUs work in parallel. The aggregate throughput is similar (you’ve split the work K ways), but the latency is ~1/K.

This is the right trade for latency-critical serving but not for throughput. For throughput, you’d rather have K independent replicas each serving different requests at full capacity.

28.3 Pipeline parallelism for inference

Pipeline parallelism (PP) splits the model along the layer axis: GPUs 0..K-1 each hold a contiguous chunk of layers. For inference, this is much less attractive than for training, because:

(1) PP introduces pipeline-bubble latency. A new request has to traverse all K pipeline stages before producing its first token. Each stage is a separate forward pass, so the latency is K × per-stage time. Compared to TP, where the GPUs work in parallel, PP serializes the work — which is bad for single-request latency.

(2) PP requires micro-batching to fill the pipeline. Without many concurrent requests, the pipeline has bubbles (idle stages while one request travels through). With continuous batching across many requests, PP can be efficient, but only when there’s enough traffic to keep all stages busy.

(3) PP doesn’t reduce memory per GPU as much as TP. Each PP stage holds its layers’ worth of weights, plus the KV cache for the requests in flight on that stage. The savings per GPU are 1/K of the weights, but the KV cache is partitioned across stages (each stage holds only the K and V vectors for its own layers, not for all layers).

Despite all this, PP is still useful for very large models that don’t fit in one node. An 8-H100 node has ~640 GB total.

Pipeline parallelism: layers split across nodes; each micro-batch flows through stages sequentially, creating a pipeline bubble when only one request is in flight. PP=4 across nodes, TP=8 within each node — 405B on 4 H100 nodes Node 0 PP stage 0 Layers 0–30 TP=8 internal 8 × H100 acts few KB Node 1 PP stage 1 Layers 31–60 TP=8 internal 8 × H100 Node 2 PP stage 2 Layers 61–90 TP=8 internal 8 × H100 Node 3 PP stage 3 Layers 91–125 TP=8 internal 8 × H100 out Activations between PP stages cross node via InfiniBand — just a few KB per token in decode, so cross-node PP latency is acceptable. All-reduces within each stage stay on fast NVLink.
PP splits model layers across nodes; inter-node communication is only the small activation tensor at each stage boundary, making PP viable over slower InfiniBand while intra-node TP uses fast NVLink.
A 405B model (810 GB in bf16) doesn't fit. The options are:
  • TP across multiple nodes. But cross-node TP over InfiniBand is slow.
  • PP across nodes. Each node holds a chunk of layers.

PP is the right answer for cross-node parallelism because it has much smaller communication than TP. Between PP stages, you only need to send the activations at the layer boundary — (N, S, D) per request. For decode, that’s a few KB. PP can run over slower interconnect without much penalty.

A typical 405B inference setup: TP=8 within each node, PP=4 across nodes. The model is split 32 ways total (4 PP stages × 8 TP within each stage). Each GPU holds 1/32 of the weights, ~25 GB.

28.4 Expert parallelism

Expert parallelism (EP) is specific to Mixture of Experts (MoE) models (Chapter 34). In MoE, each transformer block has multiple parallel “experts” (FFN sub-modules), and a routing network sends each token to a few experts (typically the top-2). The experts are independent.

EP splits the experts across GPUs: GPU 0 holds expert 0, GPU 1 holds expert 1, and so on. When a token is routed to expert i, it is sent to GPU i for processing. After the expert finishes, the result is sent back.

This is an all-to-all communication pattern: every GPU has tokens for every other GPU’s experts, and all the routing happens in one step. The all-to-all is more complex than the all-reduces of TP, but it’s natively supported by NCCL.

EP shines for models like DeepSeek-V3 with 256 experts, where TP can’t usefully split a single expert (the experts are too small to benefit from intra-expert parallelism), but there are enough experts to spread across many GPUs.

DeepSeek-V3 production serving uses EP across multiple nodes. The all-to-all latency is the dominant cost; getting good performance requires fast network (InfiniBand RDMA at minimum) and careful scheduling to balance the load across experts.

For dense models (no experts), EP doesn’t apply.

28.5 Sequence parallelism for inference

Sequence parallelism (SP) splits the sequence dimension across GPUs. We covered it in Chapter 12 as a training technique to fit long-context activations into limited memory. For inference, the use case is similar: when the prompt is so long that the activations don’t fit on one GPU, split the sequence across GPUs.

For decode, sequence parallelism doesn’t apply because each step processes only one new token (the sequence dimension is 1). For prefill of very long prompts (>32k tokens), SP can reduce per-GPU memory pressure.

In production serving stacks today, SP is uncommon for inference. The more common approach is to chunk the prefill (Chapter 23) — process the long prompt in smaller chunks across multiple forward passes — instead of splitting it spatially across GPUs.

SP becomes important if you’re serving very long contexts (>128k tokens) and need to run prefill on a single request. The Ring Attention paper (Liu et al., 2023) is the canonical reference for distributing attention computation across the sequence axis at inference time.

28.6 Data parallelism: replicas, not parallelism

For inference, data parallelism doesn’t apply within a single request. Each request’s forward pass is its own thing; you can’t split it across DP replicas the way you can split a training batch.

What inference uses instead is replication: run K independent copies of the model on K separate GPU groups, and route incoming requests to whichever copy has capacity. This is operationally identical to running multiple instances of the same service behind a load balancer.

The math is straightforward: K replicas serve the requests per second of one replica. Throughput scales linearly with replicas, latency per request stays the same.

The interesting question is: what’s the optimal replica unit? Should each replica be a single GPU? A single TP=8 group on one node? A TP=8 × PP=4 cross-node group? The answer depends on:

  • Model size. Bigger models need more GPUs per replica.
  • Latency target. Tighter latency means more parallelism per replica.
  • Throughput target. Higher throughput means more replicas.
  • Hardware constraints. Can’t have a fractional GPU per replica.

The typical decision for a 70B model:

  • Latency-critical chat (300ms TPOT): TP=8 per replica, one node per replica. Lots of replicas.
  • Batched chat (1s TPOT acceptable): TP=4 per replica, four replicas per node. Smaller per-replica latency budget but higher throughput per node.
  • Cost-optimized (latency doesn’t matter much): TP=2 per replica, eight replicas per node. Highest throughput.

Each choice trades latency for throughput per dollar.

28.7 The “fits in one GPU” question

The first question for any inference deployment: does the model fit on one GPU?

If yes, you have a choice. You can:

  • Run TP=1 (no parallelism) and replicate K times for throughput.
  • Run TP=K and serve each request faster with lower throughput per GPU.

The trade is latency vs throughput per GPU. With TP=1, each request gets one GPU and runs as fast as one GPU can. With TP=8, each request gets 8 GPUs and runs roughly 8× faster, but you can only run 1/8 as many requests in parallel. The total per-GPU throughput is similar in either case.

The choice is driven by your latency requirements. If you need 100ms TPOT, you need TP. If you can tolerate 300ms TPOT, TP=1 + replication is simpler.

For models that don’t fit on one GPU, you have no choice — you need TP at minimum to load the weights. The question is then how much TP: minimum is whatever fits the model; more is for latency.

For Llama 3 70B (140 GB in bf16): minimum TP is 2 on H100 (2 × 80 = 160 GB). For lower latency, TP=4 or TP=8 within a node.

For Llama 3 405B (810 GB in bf16): minimum on H100 is TP=11, which doesn’t divide evenly into 8. Realistic options: TP=8 + PP=2 (4 nodes total), or TP=4 + PP=4 (4 nodes), or some other combination. Frontier 405B serving typically uses TP=8 × PP=4 or quantizes to FP8 and uses TP=8 on a single node.

28.8 Combining: TP × PP for big models

For models that need both intra-node and cross-node parallelism, the standard combination is TP within a node, PP across nodes. The reasons:

  • TP needs fast interconnect. NVLink within a node is fast enough; cross-node fabric usually isn’t.
  • PP communicates less. Layer-boundary activations are small; PP can run over slower interconnect without latency penalty.
  • PP introduces a fixed pipeline-bubble cost. This is small if there are enough concurrent requests in flight.

The configuration for 405B on 4 H100 nodes:

  • TP=8 within each node (one node serves a slice of every layer).
  • PP=4 across nodes (each node handles one quarter of the layers).
  • Total: 32 GPUs, ~25 GB per GPU.

The forward pass routes a request through the pipeline: PP stage 0 (8 GPUs running TP=8 on layers 0-30), then stages 1–3 in sequence. The activations between stages are tiny (a few KB per token in decode). The all-reduces within each stage are over NVLink and fast.

This is the production setup for the largest open models. The frontier labs sometimes go even further with TP=8 × PP=4 or higher for trillion-parameter MoE models.

28.9 The production reality

A few summary points about production inference parallelism:

  • TP=8 within a node is the most common configuration for 70B+ models. NVLink-based, low-overhead, well-supported.
  • PP across nodes for the largest models. Adds latency but enables serving models that don’t fit in one node.
  • Replicas for throughput. Don’t try to scale a single replica beyond its sweet spot; add more replicas instead.
  • EP for MoE. Experts get their own GPUs; tokens are routed across the EP group.
  • Quantization first. Before reaching for more parallelism, quantize to FP8 or INT4. A 405B model in FP8 fits in TP=8 on a single node — much simpler than the bf16 cross-node setup.
  • vLLM, SGLang, TensorRT-LLM all support TP and PP via configuration flags. Setting tensor_parallel_size=8 and pipeline_parallel_size=2 is usually all you need at the framework level.

The complexity is in picking the right configuration for the workload, not in implementing the parallelism. The frameworks handle the actual collective communication for you.

28.10 The mental model

Eight points to take into Chapter 29:

  1. Inference parallelism trades latency vs throughput per GPU, in addition to memory.
  2. Tensor parallelism within a node is the default for any model that doesn’t fit on one GPU (or that needs lower latency).
  3. Pipeline parallelism across nodes for the very largest models. PP communicates less than TP, so it tolerates slower interconnect.
  4. Expert parallelism for MoE models. All-to-all routing of tokens across expert GPUs.
  5. Sequence parallelism is rare in production inference. Long-prompt prefill chunking is more common.
  6. Replication is “data parallelism” for inference. Run K independent replicas, route requests by load.
  7. Combine TP × PP for cross-node serving of frontier-scale models.
  8. Quantize first. A FP8 405B fits in TP=8 on one node; the bf16 405B needs cross-node parallelism. Quantization simplifies the parallelism story.

In Chapter 29 we look at the technique that exploits the same prefixes that prompts often share — prefix caching and radix attention.


Read it yourself

  • The Megatron-LM paper for the canonical TP description (Chapter 12 references).
  • The vLLM documentation on tensor parallelism and pipeline parallelism configuration.
  • The DeepSeek-V3 paper, section on MoE and EP serving.
  • Liu et al., Ring Attention with Blockwise Transformers for Near-Infinite Context (2023) — the canonical SP-for-inference reference.
  • The TensorRT-LLM documentation on multi-GPU and multi-node deployment.

Practice

  1. For Llama 3 70B in bf16, compute the minimum TP needed on H100 (80 GB per GPU). Now compute the minimum after quantizing to FP8.
  2. Why is cross-node TP a bad idea, but cross-node PP is acceptable? Compare the communication patterns and bandwidth needs.
  3. For a workload that needs 100ms TPOT and serves a 70B model, design the parallelism configuration on a single 8-H100 node. How many replicas can you fit?
  4. Explain why expert parallelism is the right tool for MoE models but not for dense models.
  5. In a TP=8 setup, the all-reduce after each transformer block is over an (N, S, D) tensor. For decode with N=1, S=1, D=8192, what’s the size? How long does it take over NVLink (~900 GB/s)?
  6. Why doesn’t TP make individual GPUs less memory-bound? Compute the per-GPU AI for (1, 4096) → (1, 4096) matmul split into TP=8.
  7. Stretch: Run vLLM with TP=2 and TP=4 on a 70B model on a multi-GPU machine. Measure TTFT and TPOT for both configurations. Verify the latency improvement matches the parallelism factor.

Concept check

4 questions. Click a choice to check. Your score is saved locally.

Score
0 / 4
  1. 1. Why is tensor parallelism (TP) strongly preferred within a single node for inference, whereas pipeline parallelism (PP) is preferred across nodes?
  2. 2. In the context of inference, data parallelism (replica-based) is used for which goal, and why is it simpler than intra-model parallelism?
  3. 3. For a 70B dense model that does not fit on a single 80 GB H100, which combination of parallelism strategies is standard production practice and why?
  4. 4. Sequence parallelism (SP) at inference time is most beneficial in which specific scenario?
Related chapters