Appendices
Appendix D ~25 min read

Hardware reference

The GPU and networking specs that matter for ML systems work. Tables first, interpretation second, practical choices third. Numbers are approximate and drawn from vendor datasheets; always confirm against the current datasheet for your specific SKU before making a purchase decision. Cloud prices drift; interconnect roadmaps slip; the ranking between SKUs on any given workload depends on whether you’re compute-bound or memory-bound or interconnect-bound, which is the real point of this appendix.

Use this the way a chef uses a knife chart: to know what each tool is good for, not as a substitute for cooking.


D.1 NVIDIA training and inference GPUs

The dominant datacenter GPUs for LLM work, in rough chronological order. FLOPs are peak theoretical, dense unless noted. Real workloads see 30-70% of peak on attention-heavy models.

GPUYearArchFP32 (TFLOPs)BF16/FP16 dense (TFLOPs)FP8 dense (TFLOPs)FP4 dense (TFLOPs)HBM sizeHBM BWNVLink BWPCIeTDPRough $
V100 (SXM2)2017Volta15.7125 (tensor)32 GB0.9 TB/s300 GB/sGen 3300 Wused ~$2k
A100 (40 GB)2020Ampere19.531240 GB1.6 TB/s600 GB/sGen 4400 W~$10k
A100 (80 GB)2021Ampere19.531280 GB2.0 TB/s600 GB/sGen 4400 W~$15k
H100 (SXM5)2022Hopper67989197980 GB3.35 TB/s900 GB/sGen 5700 W~$30k
H100 (PCIe)2022Hopper51756151380 GB2.0 TB/s600 GB/s (NVL pairs)Gen 5350 W~$25k
H200 (SXM)2024Hopper679891979141 GB4.8 TB/s900 GB/sGen 5700 W~$32k
B100 (SXM)2024Blackwell180035007000192 GB8 TB/s1800 GB/sGen 5700 Wpreview
B200 (SXM)2024Blackwell225045009000192 GB8 TB/s1800 GB/sGen 51000 W~$40-50k
GB200 (superchip)2024Blackwell5000 (2× B200)1000020000384 GB16 TB/s1800 GB/s per GPUGen 52700 W pair + Grace~$70-80k
GB200 NVL72 (rack)2025Blackwell18000036000072000013.8 TB130 TB/s aggregateGen 5~120 kW rack~$3M+ per rack

Sparse FLOPs are conventionally 2× dense for NVIDIA’s 2:4 structured sparsity; mostly irrelevant for LLM inference workloads because activations are not consistently sparse. The B200/GB200 FLOP numbers include the fifth-generation Tensor Cores with FP4 and native FP6 support, plus the Blackwell “microtensor” scaling format.

D.1.1 AMD

AMD’s MI300 series is the only serious non-NVIDIA datacenter training and inference GPU. The appeal is more HBM per GPU and a fundamentally different software stack (ROCm) that is slowly catching up.

GPUYearBF16/FP16 (TFLOPs)FP8 (TFLOPs)HBM sizeHBM BWInfinity Fabric BWTDPRough $
MI250X2021383128 GB3.2 TB/s400 GB/s560 W~$15k
MI300X202313002600192 GB5.3 TB/s896 GB/s750 W~$20k
MI325X202413002600256 GB6.0 TB/s896 GB/s1000 W~$22k

MI300X has more HBM than H100 (192 GB vs 80 GB) at comparable BF16 throughput, which makes it attractive for large-model inference where HBM is the binding constraint. The catch is that ROCm tooling is still thinner than CUDA (vLLM and SGLang support MI300X; TensorRT-LLM does not; FlashAttention-3 does not; FP8 support varies).

D.1.2 Google TPU

TPUs exist. They’re excellent inside Google Cloud. They do not use CUDA, they use XLA, and the whole ecosystem is parallel to the NVIDIA world.

TPUYearBF16 (TFLOPs)HBMHBM BWICI BWRough rental
v4202127532 GB1.2 TB/s540 GB/s~$3/chip-hour
v5p202345995 GB2.8 TB/s4.8 TB/s~$5/chip-hour
v5e202319716 GB0.8 TB/s400 GB/s~$1.5/chip-hour
Trillium (v6)202490032 GB1.6 TB/s3.2 TB/s~$4/chip-hour

TPU pods use a 3D torus topology via ICI (inter-chip interconnect), which is different from NVLink’s flat fabric and has different programming implications. You almost never touch TPUs directly unless you’re inside Google or specifically targeting JAX.


D.2 What each number actually constrains

This is the part most engineers miss. Peak FLOPs are a poster number; they’re rarely the binding constraint for LLM inference. Here’s the mental model.

Compute (FLOPs) binds prefill. When you process a long prompt, every token in the prompt runs through the same weight matrices in parallel. Arithmetic intensity is high (the model weights are read once and used for many tokens), so the GPU is compute-bound. This is why “tokens/sec prefill” on a given GPU tracks FLOPs roughly linearly, and why quantization (which cuts bytes-per-op but not FLOPs-per-op) helps decode more than prefill.

HBM bandwidth binds decode. One forward pass for one token reads the entire model’s weights from HBM (for a dense model) or the active experts’ weights (for an MoE). At batch size 1, the arithmetic intensity is ~1 op per byte, and you are 100% memory-bound. The ceiling on tokens/sec is:

max_tokens_per_sec ≈ HBM_bandwidth / model_size_bytes

For Llama 3 70B in BF16 (140 GB) on an H100 (3.35 TB/s), that’s 3350 / 140 ≈ 24 tokens/sec at batch 1. Observed numbers match this closely. Two H100s with TP=2 give roughly 2 × 3350 / 140 ≈ 48 tokens/sec, assuming the TP communication doesn’t bottleneck (which is why NVLink matters here — see below).

The consequence: if you care about latency, HBM bandwidth matters more than FLOPs. H200 has the same compute as H100 but 43% more HBM bandwidth. For pure decode latency, H200 is ~40% faster; for prefill-heavy workloads, it’s a wash.

Roofline model for H100 BF16: memory-bound region (sloped line) meets compute-bound region (flat ceiling) at the ridge point near 295 FLOPs per byte. LLM decode sits far left; prefill sits right of the ridge. Arithmetic intensity (FLOPs / byte) Throughput (TFLOPs/s) 1 10 100 1000 10 100 989 ridge: ~295 FLOPs/byte memory-bound (reduce bytes) compute-bound (reduce FLOPs) decode ~2 FLOPs/byte prefill (long) compute-bound
LLM decode at batch 1 sits far left of the roofline ridge (≈2 FLOPs/byte vs 295 for H100 BF16) — it is always memory-bound, so HBM bandwidth is the lever, not FLOPs.

HBM size binds how big a model you can fit. Or how much KV cache you can hold, which is where the real fight lives. A 70B model takes ~140 GB at BF16, which fits on two H100s or one H200 (141 GB) or one MI300X (192 GB, with ~50 GB left for KV cache). Quantization to FP8 halves the weight footprint to ~70 GB, unlocking single-GPU serving.

Interconnect bandwidth binds tensor parallelism. TP splits each matmul across GPUs and requires an all-reduce after every layer (every two layers, actually — once after attention and once after the FFN). The all-reduce volume per token per layer scales with the hidden dimension. If your interconnect is too slow, TP stops scaling before you hit the FLOPs limit. Rule of thumb: NVLink (900 GB/s) is necessary for TP above 2 on large models; PCIe Gen 4 (64 GB/s) is painful; PCIe Gen 5 (128 GB/s) is tolerable for small models.

Power and cooling bind density. A B200 SXM module is 1000 W; a GB200 rack draws ~120 kW. Most datacenters cap out at 30-40 kW per rack on air cooling and require liquid cooling above that. This is why GB200 NVL72 ships as a liquid-cooled rack from the factory: you can’t deploy it in a facility that doesn’t have direct-to-chip liquid cooling plumbing. The practical consequence for capacity planning: there is a huge difference between “we bought Blackwell” and “we can plug it in.”


D.3 Interconnects and networking

NVLink is NVIDIA’s GPU-to-GPU interconnect. Much faster than PCIe; the backbone of intra-node TP.

GenerationYearPer-link BWGPU total BWUsed in
NVLink 2201725 GB/s300 GB/s (6 links)V100
NVLink 3202050 GB/s600 GB/s (12 links)A100
NVLink 4202250 GB/s900 GB/s (18 links)H100/H200
NVLink 52024100 GB/s1800 GB/s (18 links)B100/B200/GB200

NVSwitch is the NVLink switch chip. In an 8-GPU HGX (DGX H100, HGX H100), four NVSwitch chips create a fully-connected fabric where any GPU can talk to any other GPU at full NVLink bandwidth. The GB200 NVL72 extends this to 72 GPUs in a rack via external NVSwitch trays, at 130 TB/s aggregate bandwidth — effectively making a 72-GPU system behave like one very large GPU for the purpose of TP and EP.

The big jump between H100 and NVL72 is the scale of the coherent NVLink domain. On an HGX H100, you can do TP=8 inside the box; for TP=16 you need two boxes connected by InfiniBand, which is much slower. On NVL72, you can do TP=72 (or TP=8 × EP=9, or whatever factorization you want) entirely over NVLink. This unlocks MoE models that don’t fit on a single box.

Three GPU connectivity tiers: 8-GPU HGX node with NVSwitch fabric (900 GB/s per GPU), two nodes linked by InfiniBand NDR (50 GB/s per port), and the NVL72 rack with 72 GPUs in one NVLink domain at 130 TB/s aggregate. HGX H100 node NVSwitch fabric · 900 GB/s/GPU NVSwitch GPU GPU GPU GPU GPU GPU GPU GPU Cross-node (2 boxes) InfiniBand NDR · 50 GB/s/port Node A 8× GPU NVLink inside IB NDR 50 GB/s Node B 8× GPU NVLink inside NVL72 rack 72 GPUs · 130 TB/s aggregate 72 GPUs all NVLink one domain TP=72 possible or TP×EP mixes NVLink (fast) InfiniBand (slow across nodes)
NVLink provides GPU-to-GPU bandwidth 18× faster than InfiniBand NDR; the NVL72 rack extends the NVLink domain to 72 GPUs, enabling TP and EP combinations that previously required slower inter-node links.

D.3.2 PCIe

For CPU-GPU and slower GPU-GPU paths.

GenerationPer-lane BWx16 BWUsed for
PCIe 31 GB/s16 GB/sOld servers
PCIe 42 GB/s32 GB/sA100, some H100
PCIe 54 GB/s64 GB/sH100, H200, Grace Hopper
PCIe 68 GB/s128 GB/sComing; not yet widely deployed

PCIe bandwidth is shared: CPU-to-GPU data transfer competes with storage, network, and NVMe traffic. When a GPU is loading a large model from disk at pod start, PCIe is usually the bottleneck, not the GPU or the storage.

D.3.3 InfiniBand and RoCE

For GPU-to-GPU across nodes. The fabric that binds a multi-node training cluster together.

StandardSignaling ratePer-port BW (1x)Per-port BW (4x)Used for
EDR25 Gb/s3 GB/s12 GB/s2016 clusters
HDR50 Gb/s6 GB/s25 GB/s2019 clusters
NDR100 Gb/s12 GB/s50 GB/sH100 clusters
XDR200 Gb/s25 GB/s100 GB/sBlackwell clusters

Most training clusters use one 4x NDR (400 Gb/s ≈ 50 GB/s) or 4x HDR (200 Gb/s ≈ 25 GB/s) per GPU, delivered via GPUDirect RDMA (NIC talks directly to GPU HBM, bypassing CPU). A large cluster has a non-blocking fat-tree or dragonfly topology so any-to-any communication is close to line rate.

RoCE (RDMA over Converged Ethernet) is the same RDMA semantics running on Ethernet. Comparable bandwidth but requires lossless Ethernet configuration (PFC, ECN, Priority-based Flow Control), which is operationally harder. The choice is usually dictated by what the cloud or datacenter already has: Azure ND-series and Oracle OCI use InfiniBand; some Google and AWS clusters use RoCE.

Inference workloads rarely need multi-node interconnect at all, because inference usually fits in one node. The exception is disaggregated PD (Chapter 36), where KV cache transfers between a prefill-box and a decode-box. There, NVLink within a box is preferred; if both are in the same NVL72 domain, you get ~TB/s transfer speed. If they’re in separate boxes, you’re limited to InfiniBand (~50 GB/s), which is fine for the per-token KV volumes (100s of KB) but might queue under load.

D.3.4 Ethernet

For everything else (service-to-service, user traffic, object storage pulls): 10/25/100/200/400/800 GbE. A modern server has two 200 GbE ports to the TOR (top-of-rack) switch; a GPU training node has those plus 4-8 InfiniBand ports.


D.4 Topologies

Single-node training, 1-8 GPUs. HGX H100 or MI300X box. Tensor parallelism inside the box over NVLink. Nothing exotic. 80% of serious production LLM inference runs on this topology or smaller.

Multi-node training, 32-1024 GPUs. 4-128 DGX boxes connected via InfiniBand NDR. FSDP or Megatron-style TP+PP. All-reduce traffic dominates the network budget. Non-blocking fat-tree at 1:1 oversubscription.

NVL72 rack. 72 GPUs in one NVLink domain. Currently the largest coherent GPU cluster you can buy. Changes the tradeoff on TP-vs-PP because communication is essentially free inside a rack. Expected to become the default for training frontier models in the Blackwell generation.

Disaggregated PD pair. Two vLLM processes on the same box (or two boxes, one each with a prefill- and decode-optimized config), connected by NCCL P2P over NVLink or by RDMA over IB for inter-box. See Chapter 36 for the full details.

Inference edge. Single consumer GPU (RTX 4090, 5090) or CPU. llama.cpp plus GGUF-quantized weights. Nothing in this appendix except the FP4/INT4 quantization story applies here — see Chapter 44.


D.5 Picking hardware for a workload

Some real decisions framed as “what’s the binding constraint.”

Small 7B dense model, latency-sensitive chat. Binding constraint: decode latency, HBM bandwidth per GPU. Any single GPU with ≥ 16 GB HBM works. H100 gives the best tokens/sec per GPU; A100-80GB gives the best dollar-per-token; an RTX 6000 Ada or L40S works on a budget. If you care about both prefill (long system prompts) and decode, prefer H100; if you’re mostly decoding short turns, A100-80GB is 2× cheaper.

70B dense model, chat at moderate throughput. Binding constraint: HBM size for the model plus useful KV cache. One H200 (141 GB) fits a FP8 70B with plenty of KV cache. Two H100s (2 × 80 GB) with TP=2 work too but cost more in interconnect-fragility and slightly worse latency. One MI300X fits the same in BF16 with ROCm — cheaper per GB of HBM but the software stack is thinner.

405B or 671B dense or MoE, offline batch. Binding constraint: fitting the model at all, plus enough KV cache for large batches. Minimum: 8 × H100 or 8 × MI300X. For DeepSeek-V3 671B, 8 × MI300X at FP8 fits comfortably with EP=8. H100 8-way works too but you’re quantizing aggressively (FP8 weights plus some offload). If you have NVL72, this is the workload it’s made for.

Vision-language model with 1000-token images. Binding constraint: prefill throughput. You want FLOPs, not HBM bandwidth. H100 over A100. B200 if available. Consider disaggregated PD (Chapter 36) to decouple prefill from decode — the payoff is large here, 30-50% per-GPU throughput gain.

Embedding / reranker service. Binding constraint: short-sequence throughput and model size (these models are small, 100-500M params). L40S or A10 are almost always the right answer — don’t pay for an H100. Optimize for QPS per dollar, not for tokens per second.

Pure batch throughput, dollars per million tokens. Binding constraint: $ / (GPU-hour) divided by throughput per GPU. A100-80GB at reserved or spot pricing usually wins for 7B-13B models. H100 on-demand is the worst choice. For 70B, H200 starts to win on $ / token because of the bandwidth. MI300X is competitive if your software stack supports it.

Frontier training run. Binding constraint: aggregate compute × time, with interconnect as the scaling limit. H100 cluster at 1024+ GPUs on InfiniBand NDR, or NVL72 racks if you can get them. The practical question is rarely “which GPU” but “how many can we get and when.”


D.6 Back-of-envelope numbers to memorize

For interviews and capacity planning:

  • H100 BF16: ~1 PFLOP dense, ~3.35 TB/s HBM. 80 GB.
  • H100 FP8: ~2 PFLOP dense.
  • A100: ~312 TFLOP BF16, ~2 TB/s HBM, 80 GB.
  • Llama 3 70B BF16: 140 GB of weights. KV cache: 320 KB/token.
  • Llama 3 8B BF16: 16 GB. KV cache: 128 KB/token.
  • Dense-model decode tokens/sec at batch 1: HBM_BW / model_size. ~24 tok/s for 70B on H100, ~200 tok/s for 8B on H100.
  • Prefill TFLOPs needed per token: ~2 × params_in_flops_path. For 70B dense, ~140 GFLOPs per prompt token (two per parameter for the matmul).
  • Batch size ceiling: usable_HBM_after_weights / (KV_per_token × sequence_length).
  • NVLink 4 BW: 900 GB/s per GPU. NVLink 5: 1.8 TB/s.
  • InfiniBand NDR 4x: 50 GB/s per port.
  • PCIe Gen 5 x16: 64 GB/s.
  • H100 on-demand cloud price: $3-5/hour. Reserved: ~half that.
  • Watts per rack on air cooling: 30-40 kW ceiling. Liquid cooling: 120+ kW.
Memory and interconnect bandwidth comparison across GPU generations: HBM bandwidth from V100 (0.9 TB/s) to GB200 (16 TB/s), and interconnect from PCIe Gen3 (16 GB/s) to NVLink 5 (1800 GB/s). HBM bandwidth (TB/s) Interconnect bandwidth (GB/s) V100 0.9 A100 2.0 H100 3.35 H200 4.8 MI300X 5.3 B200 8.0 GB200 16.0 PCIe 3 16 PCIe 5 64 NVLink 2 300 NVLink 3 600 NVLink 4 900 NVLink 5 1800 IB NDR 50 GB/s
Each GPU generation roughly doubles HBM bandwidth and NVLink bandwidth — the interconnect gap between NVLink (1800 GB/s) and InfiniBand NDR (50 GB/s) explains why TP across nodes is so much more expensive than TP within a node.

D.7 What changes generation to generation (pattern to expect)

Each generation roughly doubles FLOPs, doubles-ish HBM bandwidth, and increases HBM capacity by 25-50%. NVLink roughly doubles. The software rarely catches up fast enough to use the new capabilities on day one — FP8 on H100 took a year to become standard; FP4 on Blackwell will take the same.

The practical implication: don’t buy the newest generation if you can get the prior generation cheap. The A100 generation is still economically optimal for many production workloads as of 2026. H100 is the current workhorse. H200 wins for 70B-class decode. B200 / GB200 are worth it for training and for the largest MoE inference where HBM is the constraint — everything else is overpaying for peak FLOPs you can’t feed.

And always read the actual datasheet before a purchase. The numbers in this appendix are approximate and will be out of date the moment you read them.


D.8 The roofline model in one page

The roofline model is the clearest way to reason about what’s bottlenecking a kernel. Plot arithmetic intensity (FLOPs per byte) on the x-axis and achieved throughput (GFLOPs/sec) on the y-axis. The “roofline” has two segments:

  • A sloped line at low intensity: throughput = HBM_bandwidth × intensity. Memory-bound region.
  • A horizontal line at high intensity: throughput = peak_FLOPs. Compute-bound region.

The knee of the roofline, where the two regions meet, is at intensity = peak_FLOPs / HBM_bandwidth. For H100 BF16, that’s 989 TFLOPs / 3.35 TB/s ≈ 295 FLOPs/byte. For H100 FP8, 1979 / 3.35 ≈ 590 FLOPs/byte.

What it tells you: if your kernel’s arithmetic intensity is below the knee, you’re memory-bound — adding more FLOPs won’t help; reducing bytes will. If you’re above, you’re compute-bound — reducing FLOPs will help; increasing bandwidth won’t.

LLM decode at batch 1: reads the entire model (140 GB for 70B in BF16) to produce one token. Intensity ≈ 2 ops/byte (one multiply and one accumulate per weight byte). Way below the knee. Memory-bound. Always. The only way to raise intensity is to batch — if you process B tokens in one pass against the same weights, intensity scales roughly with B (up to the point where activations become the dominant cost). That’s why decode throughput scales with batch size.

LLM prefill with a long prompt: S tokens in one pass. Intensity ≈ 2S ops/byte, which for S ≥ 300 is above the knee. Compute-bound. Adding more HBM bandwidth doesn’t help; quantization (which reduces bytes but not ops) doesn’t help as much.

Attention specifically: the attention matmul QKᵀ has intensity that scales with the head dimension d_h, not with sequence length directly, because each Q-row is reused across all K-rows for its dot products. But materializing the full attention matrix to HBM and then reading it back for the softmax hammers memory bandwidth — which is exactly what FlashAttention avoids by keeping the softmax computation in SRAM.

Use this mental model on any new kernel question. Knowing where you sit on the roofline tells you which optimizations pay off.


D.9 Concrete deployment sketches

Five realistic deployments with the hardware decisions shown.

Sketch 1: 7B chat service, 200 QPS, cost-sensitive. A100-80GB reserved at $1.50/hr. Single GPU per replica, no TP needed. vLLM with FP16 (no quantization — 7B fits easily, quantization wins less here). max_num_batched_tokens=8192. Continuous batching plus prefix caching. Expect ~150-250 tok/s decode at batch size ~20. At 200 QPS with average 2s latency, Little’s Law says 400 concurrent requests → ~20 replicas. Cost: $30/hr, or $720/day. Per million tokens: ~$0.15 if averaging 500 output tokens per request.

Sketch 2: 70B chat service, 500 QPS, moderate tail-latency constraint. H200 single-GPU per replica (70B FP8 fits in 141 GB with plenty of KV cache). vLLM with FP8 via Transformer Engine. max_num_seqs=128, enable_prefix_caching=true. Expect ~80-100 tok/s per request at batch 30. Little’s Law: 500 × 2s = 1000 concurrent → ~35 replicas at ~30 req/replica. Cost: 35 × $5/hr ≈ $175/hr. Per million tokens: ~$1.

Sketch 3: VLM serving, 50 QPS, image-heavy workload. Disaggregated PD: 2 × H100 prefill + 4 × H100 decode per logical replica (the ratio reflects that VLM prefill dominates). Vision encoder (SigLIP) shared in the prefill process; LLM tensor-parallel across decode GPUs. NCCL P2P over NVLink between the two stages. Per-GPU throughput gain ~30% vs vanilla. Cost: same dollars, more latency budget. Expect ~40% latency reduction.

Sketch 4: Embedding service, 5000 QPS, short queries. L40S (48 GB, ~$1/hr). One GPU per replica. TEI runtime (not vLLM — different workload). Expect 10K-30K embeddings/sec per GPU for 256-token inputs. At 5000 QPS with small queries, one or two replicas handle the load. Cost: $2/hr. Trivial.

Sketch 5: 405B or 671B MoE, offline batch. 8 × MI300X (192 GB each, total 1.5 TB) with EP=8. FP8 weights. Can fit DeepSeek-V3 671B with room for big KV caches. For offline throughput (latency doesn’t matter), maximize batch size. Expect ~20K tokens/sec aggregate in best case. Cost: 8 × $5/hr ≈ $40/hr. Per million tokens: ~$0.50 if throughput holds.


D.10 A note on cloud instance naming

Knowing the instance families by hardware is half the battle when sizing workloads on public clouds.

AWS. p4d, p4de = A100. p5, p5e, p5en = H100 (8-way, with EFA InfiniBand-equivalent). p6 = B200 class (announced 2024). g5, g6 = L4/L40S for inference. g6e = L40S with more memory.

GCP. a2 = A100. a3 = H100. a3-mega = H100 with 8-way 800 Gb/s network. a3-highgpu-8g = 8×H100. TPU family is v4, v5p, v5e, v6e (Trillium).

Azure. ND A100 v4 = A100. ND H100 v5 = H100 8-way. ND H200 v5 = H200. ND MI300X v5 = MI300X 8-way.

Oracle Cloud. BM.GPU4.8 and BM.GPU.H100.8 are the bare-metal 8-way GPU SKUs with InfiniBand. OCI’s pricing is often cheaper for large reservations than AWS or GCP.

Lambda, CoreWeave, RunPod, Together, and Fireworks. Specialized GPU clouds with cheaper raw GPU rental ($1.5-$2.5/hr H100 reserved) but fewer managed services. Good for training and batch; harder for tight-SLA production serving without building your own stack.

Always check the actual interconnect: an “8× H100” instance can mean NVLink-connected (good for TP) or PCIe-only (bad for TP above 2). The top-of-line SKUs have full NVSwitch fabric; the cheaper ones have PCIe only.


D.11 Frequently missed gotchas

A grab bag of things that bite people who read only the top-line numbers.

HBM doesn’t scale with TP. Two GPUs give you 2× HBM for weight sharding, but per-GPU HBM bandwidth is still the single-GPU number. Decode throughput scales with TP width, not just HBM size.

PCIe is not NVLink. An “8-GPU server” with PCIe between GPUs is much slower for TP than an HGX with NVLink. Check the topology. For TP above 2, NVLink is effectively required.

Peak vs sustained FLOPs differ by 30-50%. The datasheet number assumes perfect kernel efficiency. Real matmul kernels hit 60-85% of peak on H100 BF16; attention kernels often lower. Sparse FLOPs assume 2:4 sparsity that LLM activations don’t have.

Power and cooling cap density. Just because a GPU draws 1000 W doesn’t mean your rack can hold 8 of them. Check the rack’s power and cooling budget before specifying.

Mixed-precision gotchas. FP8 training on H100 gives 2× throughput only for matmul; everything else (softmax, normalization, activation) still runs in BF16 or FP32. Real speedup is 1.3-1.6× on end-to-end training.

NUMA matters on multi-CPU boxes. On an 8-GPU box, each GPU is connected to a specific CPU socket. Cross-socket PCIe traffic is slower; pin workers to the correct NUMA node.

Driver versions. Every NVIDIA driver release has subtle performance differences. A kernel that hit 90% of peak on driver 535 might hit 85% on driver 545 until a fix lands. Pin the driver version for reproducibility.

Thermal throttling under load. A GPU advertised at 1 GHz clock may run at 900 MHz under sustained load in a hot rack. Peak throughput numbers assume datacenter cooling.

Consumer GPUs (4090, 5090) are not datacenter GPUs. Different precisions, no NVLink (removed in the 40-series), different driver paths, ECC not always on, no MIG. Use them for dev and edge inference, not production.


Use this appendix as a reference when you’re sizing a deployment, making a hardware-purchase recommendation, or trying to answer “why is our throughput so much worse than the datasheet says?” The answer is usually that the datasheet measures something that your kernel isn’t doing. Know which region of the roofline you’re in, pick the GPU whose binding constraint matches yours, and don’t chase FLOPs numbers that your workload can’t use.