Chapter 115: Capacity planning math: the back-of-envelope kit

Chapter 114 established the interview framework. The estimate phase — phase two of five — is the one that separates senior from mid-level most visibly. It is pure numerical fluency: can the candidate convert “1M users” into “300 H100s and $500k a month” in under two minutes, without a calculator, without hesitation? This chapter is the numerical kit that makes that possible.

The goal is not to memorize every number. The goal is to memorize the handful that everything else derives from, and to know the derivation paths well enough that the rest fall out. This chapter lists those anchor numbers, the derivation chains that use them, and the practice estimation problems that burn them into muscle memory. A candidate who drills this chapter for a week can walk into any senior ML systems interview and do capacity math faster than the interviewer can follow.

Outline:

The anchor numbers you memorize.
The latency table — L1 to inter-region.
GPU specs that matter — H100, H200, B200.
Bytes per op for each storage class.
Tokens per second per GPU per model.
Dollars per million tokens.
Network throughput across the stack.
Databases and IOPS.
The derivation chains.
Worked estimation problems.
The mental model.

115.1 The anchor numbers you memorize

These are the seventeen numbers that every senior candidate has in working memory. Everything else in this chapter is derived from them.

Quantity	Value
L1 cache hit	1 ns
RAM access	100 ns
SSD random read	100 μs
HDD seek	10 ms
Same-zone network round trip	500 μs
Cross-region network round trip	50–150 ms
H100 HBM bandwidth	3.35 TB/s
H100 bf16 peak	~989 TFLOP/s
H100 FP8 peak	~2 PFLOP/s
H100 HBM capacity	80 GB
H100 on-demand cloud price	$3–4/hr
H100 reserved cloud price	~$2/hr
Llama 70B bf16 realistic decode throughput on 1× H100	~1000 tokens/s
Llama 8B bf16 realistic decode throughput on 1× H100	~5000 tokens/s
GPT-4o output price	~$10 / M tokens
Self-hosted 70B cost	~$0.50 / M tokens at good utilization
Seconds in a day	86,400

Memorize these. They are the axes of every conversation. An interviewer who hears the candidate say “H100 is about 3 terabytes per second on HBM” relaxes, because it signals the candidate is operating at the right level.

A useful trick: seconds in a day is 86,400. Round to 100,000 for mental math. The 14% error is irrelevant in an estimate phase where the answer is good to within 2×.

115.2 The latency table — L1 to inter-region

Jeff Dean’s “Numbers Every Programmer Should Know” (2012) is the canonical version. Updated for 2025 with NVMe, HBM, and cross-region reality:

Operation	Latency	Relative to L1
L1 cache hit	1 ns	1×
Branch mispredict	3 ns	3×
L2 cache hit	4 ns	4×
L3 cache hit	15 ns	15×
Main memory (DRAM)	100 ns	100×
HBM access on H100	~150 ns	150×
Compress 1 KB with zstd	1 μs	1000×
Send 1 KB over 10 Gbps	1 μs	1000×
SSD random 4 KB read	100 μs	100,000×
NVMe sequential 1 MB read	200 μs	200,000×
Same-AZ round trip	500 μs	500,000×
Same-region (different AZ) round trip	1 ms	1,000,000×
HDD seek	10 ms	10,000,000×
Inter-region round trip (same continent)	30–80 ms	30–80M×
Inter-region round trip (US-EU)	80–100 ms	80–100M×
Inter-region round trip (US-APAC)	150–200 ms	150–200M×

Two patterns to internalize. First, RAM is 100× slower than L1, SSD is 1000× slower than RAM, HDD is 100× slower than SSD. Each layer is roughly two orders of magnitude worse.

Each storage layer is ~100–1000× slower than the one above — knowing the crossover points tells you when a cache hit changes architecture decisions.

Second, **same-AZ network is comparable to SSD latency** — this is why distributed caches work, and why same-AZ microservices are not as expensive as naive intuition suggests.

For interview purposes, the two numbers to use constantly: 500 μs (same-AZ network), 100 ms (cross-region). When the interviewer asks “should this be in-process or cross-service?”, the answer depends on whether 500 μs × N fits the latency budget.

115.3 GPU specs that matter

The four GPUs a senior candidate has memorized for 2025–2026 interviews:

GPU	HBM	Bandwidth	bf16 TFLOPs	FP8 TFLOPs	Launched
A100 80GB	80 GB HBM2e	2.0 TB/s	312	—	2020
H100 SXM	80 GB HBM3	3.35 TB/s	989	1979	2022
H200 SXM	141 GB HBM3e	4.8 TB/s	989	1979	2024
B200	192 GB HBM3e	8.0 TB/s	2250	4500	2024–2025

The key insight interviewers probe for: H100 and H200 have the same compute. H200 is an H100 with more HBM capacity and more HBM bandwidth. For memory-bandwidth-bound workloads like decode, H200 is ~43% faster than H100 because bandwidth is 4.8/3.35 ≈ 1.43×. For compute-bound workloads like prefill or training, they are identical.

B200 roughly doubles both compute and bandwidth over H100 (8 TB/s vs 3.35 TB/s, 2250 vs 989 TFLOP/s). It is also ~2.5× more expensive. The dollars-per-token math almost always works out to a wash at current pricing — B200 wins on latency, not on cost efficiency (see Chapter 30).

For AMD MI300X: 192 GB HBM3, 5.3 TB/s bandwidth, similar bf16 TFLOPs to H100. Important to name because it is the main non-NVIDIA option, but as of 2026 software maturity lags. A senior candidate mentions MI300X as a tradeoff point, not a default.

The sanity check you should do in your head: the roofline of any dense LLM inference workload is HBM_bandwidth / model_size. For Llama 70B bf16 (140 GB) on H100 (3.35 TB/s), that’s 24 weight reads per second at batch=1, meaning 24 tokens per second per request at the memory limit. Batch of 64 pushes total throughput to ~1500 tokens/sec. The “realistic 1000 tokens/sec” number is this formula minus some efficiency loss.

115.4 Bytes per op for each storage class

Storage classes ranked by bytes-per-dollar and bytes-per-second:

Class	Cost/GB/month	Read bandwidth (single device)	Typical size
HBM	— (on-GPU)	3.35 TB/s (H100)	80–192 GB
DRAM	~$4	100 GB/s	1 TB (server)
NVMe SSD	~$0.10	7 GB/s	30 TB (dense server)
SATA SSD	~$0.05	500 MB/s	4 TB
HDD	~$0.02	200 MB/s	20 TB
Object storage (S3)	~$0.023	100 MB/s per connection, ~GB/s parallel	unbounded
Glacier	~$0.004	minutes	unbounded

The two rules that follow:

HBM is ~30× faster than DRAM and ~500× faster than NVMe. When a KV cache gets evicted from HBM to DRAM, the fetch cost is tens of microseconds per block. When it gets evicted to NVMe, it’s tens of milliseconds. This is why KV cache offloading (Chapter 37) is a latency compromise, not a free win.

Object storage is cheap but slow per-connection. Pulling a 140 GB model weight file from S3 over a single connection takes ~1400 seconds at 100 MB/s. Pulling it in parallel with 64 connections takes ~20 seconds. The cold-start problem (Chapter 52) is entirely about which layer of this hierarchy the weights live on when a pod starts.

For capacity planning: if the question is “how much data can we keep hot?”, the answer is bounded by DRAM cost. If it’s “how much can we keep warm?”, the answer is NVMe. If it’s “how much can we keep cold?”, the answer is S3 and the budget is almost unlimited.

115.5 Tokens per second per GPU per model

The serving-throughput table for bf16, realistic (not peak), assuming vLLM with continuous batching at healthy batch sizes:

Model	GPU	Total tokens/s (realistic)	Batch size used
Llama 3 8B	1× H100	~5000	64–128
Llama 3 8B	1× A100 80	~3000	64
Llama 3 70B	2× H100 TP=2	~1200 total	64
Llama 3 70B	1× H200	~1500	64
Llama 3 405B	8× H100 TP=8	~800	32
Mixtral 8x7B	2× H100 TP=2	~2000	64
Qwen 2.5 72B	2× H100 TP=2	~1200	64

With FP8 or INT4 quantization, multiply by 1.5–2.5×. With speculative decoding and good draft models, multiply by another 1.3–1.8×. With disaggregated prefill/decode on vision-heavy workloads (Chapter 36), add 30–50% throughput on the shared fleet.

The interview-relevant shortcuts:

8B on 1× H100 ≈ 5k tokens/s. A small model “fills” an H100 at ~5000 tokens/sec.
70B on 1× H100 (FP8) ≈ 1500 tokens/s. A medium model on one H100 with FP8.
70B on 2× H100 bf16 ≈ 1200 tokens/s. The tensor-parallel version if you need bf16.
400B on 8× H100 ≈ 800 tokens/s. The frontier-ish open model, needs 8 GPUs, hurts.

The fifth number is the one candidates get wrong most often: frontier models are slow per-GPU because most of the parallelism overhead eats the gains. Don’t pitch 70B-style throughput for a 405B.

115.6 Dollars per million tokens

The dollar numbers interviewers expect candidates to cite confidently:

API prices (late 2025 / 2026, rough):

Model	Input $/M	Output $/M
GPT-4o	$2.50	$10
GPT-4o mini	$0.15	$0.60
Claude Sonnet (latest)	$3	$15
Claude Haiku	$0.25	$1.25
Llama 70B (Together, Fireworks, DeepInfra)	$0.60–$0.90	$0.60–$0.90
Llama 8B hosted	$0.05–$0.20	$0.05–$0.20
Embedding (OpenAI text-embedding-3-small)	$0.02	—
Embedding (OpenAI text-embedding-3-large)	$0.13	—
Reranker (Cohere rerank)	~$2 per 1k searches	—

Self-hosted cost (output tokens, at good utilization):

Config	$/M tokens
8B bf16 on H100, 80% util	~$0.11
70B bf16 on H100×2, 80% util	~$0.58
70B FP8 on H100, 80% util	~$0.25
70B INT4 on H100, 80% util	~$0.18
405B bf16 on H100×8, 80% util	~$2.00

The derived rules:

Self-hosted 70B is 20× cheaper than GPT-4o output per token.
Hosted 70B on Together/Fireworks is ~$0.80 per million. That’s the benchmark for “should I self-host or not?” Below ~100M tokens/month, use hosted. Above ~500M tokens/month, self-host.
Input tokens are 2–4× cheaper than output tokens on every API. This reflects prefill (compute-bound) being cheaper than decode (bandwidth-bound) per token, plus aggregate provider economics.
Embeddings are basically free. At $0.02 per million tokens, you can afford to embed everything. Tens of millions of documents at ~500 tokens each costs pennies.

115.7 Network throughput across the stack

The throughput numbers across a typical data center stack:

Link	Bandwidth
PCIe Gen4 x16	32 GB/s
PCIe Gen5 x16	64 GB/s
NVLink per GPU pair (H100)	900 GB/s bidirectional
NVSwitch (H100 system, full 8-way)	900 GB/s per GPU
100 GbE NIC	12.5 GB/s
200 GbE NIC	25 GB/s
400 GbE NIC	50 GB/s
InfiniBand HDR (200 Gbps)	25 GB/s
InfiniBand NDR (400 Gbps)	50 GB/s
Same-region inter-AZ bandwidth (cloud)	10 Gbps typical
Cross-region bandwidth (cloud)	1–10 Gbps typical

The rules that matter:

NVLink is ~30× faster than PCIe Gen4 and ~70× faster than 100 GbE. Tensor parallel across NVLink is viable; tensor parallel across Ethernet is not.
Ethernet inter-node is the bottleneck for multi-node training. This is why InfiniBand dominates training clusters.
Same-region cross-AZ at 10 Gbps is ~1.25 GB/s. Pulling a 140 GB model across AZs takes ~2 minutes. This is why cold starts from S3 across AZs are painful.

For inference, the numbers most often used: 900 GB/s NVLink inside a node, 25 GB/s per NIC between nodes, 500 μs inter-node round trip at the application layer.

115.8 Databases and IOPS

The reference numbers for storage systems behind an ML platform:

System	Read QPS (per node)	Write QPS	p99 latency
Redis (in-memory)	100k+	100k+	<1 ms
Memcached	100k+	100k+	<1 ms
PostgreSQL (point query on SSD)	10k–50k	5k–20k	1–10 ms
MySQL (point query on SSD)	10k–50k	5k–20k	1–10 ms
MongoDB	10k–30k	5k–15k	1–10 ms
DynamoDB (per partition)	3000 reads, 1000 writes	—	<10 ms
Cassandra	10k–20k	5k–10k	1–50 ms
Elasticsearch (search)	1k–5k	1k–5k	10–100 ms
Kafka (per broker)	1M+ messages/s at small payload	—	1–10 ms
HNSW vector index (FAISS)	1k–10k searches/s per core	—	1–10 ms
IVF vector index	10k–100k/s per core	—	1–5 ms

Key interview facts:

Redis can do 100k QPS. That is the ceiling you assume for any cache layer.
Postgres can do 10k QPS for point queries. Range scans and joins are much slower; the 10k number is for primary-key lookups.
DynamoDB partitions cap at ~3000 reads/s. If your hot key exceeds this, you need to shard manually or pre-fan-out.
Kafka can absorb millions of messages per second per broker. That is why it is the backbone of every telemetry/metering pipeline.
A single-core HNSW index does 1–10k vector searches per second. Scaling requires sharding, not vertical.

115.9 The derivation chains

The four derivation chains every senior candidate can run cold, in under 90 seconds each.

Chain 1 — LLM serving from users to GPUs.

DAU → sessions/user/day → requests/session → tokens/request
  → total tokens/day → tokens/sec avg → tokens/sec peak (×3-5)
  → GPUs at peak (tokens/sec / per-GPU-throughput)
  → GPUs with headroom (×1.3)
  → $/month (GPUs × $2/hr × 24 × 30)

Chain 2 — KV cache memory budget.

n_layers × n_kv_heads × d_h × 2 (for K and V) × bytes_per_elem
  = per-token KV size
  × context_length_per_user = per-user cache size
  × concurrent_users = total cache requirement

Chain 3 — Vector index sizing.

num_documents × chunks_per_doc × tokens_per_chunk → total chunks
  × embedding_dim × 4 bytes (fp32) = raw vector bytes
  × 1.2 (HNSW overhead) = index size
  / per-node RAM budget = nodes required

Chain 4 — Storage cost for logs and traces.

QPS × bytes_per_record × 86400 s/day = bytes/day
  × retention_days = bytes at rest
  × $0.023/GB/month (S3) or $4/GB/month (DRAM cache)
  = $/month for storage
  + egress cost (rarely matters)

Master these four and nearly every estimation phase in every ML systems interview is a matter of plugging the anchor numbers from §115.1–§115.8 into one or two of them.

graph LR
  DAU["DAU"] --> SESS["× sessions/user/day"]
  SESS --> REQ["× requests/session"]
  REQ --> TOK["× tokens/request"]
  TOK --> TPDAY["tokens / day"]
  TPDAY --> TPSEC["÷ 86,400 → avg tok/sec"]
  TPSEC --> PEAK["× 3–5 → peak tok/sec"]
  PEAK --> GPUS["÷ per-GPU throughput → GPU count"]
  GPUS --> HEAD["× 1.3 headroom"]
  HEAD --> COST["× $/GPU-hr × 720 → $/month"]
  style PEAK fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Chain 1 (LLM serving): the peak tokens-per-second node is the pivot — everything before it is user math, everything after is infrastructure math.

115.10 Worked estimation problems

Problem 1. Serve Llama 3 70B to 500k DAU, each doing 5 chat turns per day with ~600 tokens of output per turn. What’s the GPU count and monthly cost?

Walking it:

500k × 5 × 600 = 1.5B tokens/day output.
1.5B / 86,400 ≈ 17,400 tokens/sec average.
Peak (×4) ≈ 70,000 tokens/sec.
Per H100 (70B bf16 on 2× H100 TP=2): ~1200 tokens/sec per pair, so ~600 per GPU effective. Round to “500 tokens/sec per GPU” for safety.
70,000 / 500 = 140 GPUs at peak.
With 30% headroom: ~180 GPUs.
180 × $2/hr × 720 hr/month = ~$260k/month.
In $/M-token terms: $260k / 1.5B × 1M / 30 ≈ $5.80/M tokens (high because we sized for peak, not average).

Problem 2. How much HBM does the KV cache consume for 200 concurrent Llama 70B requests, each at 8k context, in bf16?

Per-token cache: 2 × 80 layers × 8 kv-heads × 128 d_h × 2 bytes ≈ 320 KB.
Per request: 320 KB × 8192 ≈ 2.6 GB.
200 × 2.6 = 520 GB.

This is across all tensor-parallel shards. On 4× H100, that’s 130 GB per GPU — impossible. On 8× H200 (141 GB each, 1128 GB total), it fits. The interviewer’s follow-up: “what would you do instead?” Answer: INT8 KV cache (halves it to 260 GB), GQA reduction (already at GQA 8), prefix caching (if system prompts are shared), shorter contexts.

Problem 3. RAG index over 10 TB of text. How big is the vector index, and how many nodes to serve it from RAM?

10 TB ≈ 10 TB of tokens at 4 bytes/char / 4 chars/token ≈ ~2.5 trillion tokens. Too many; must be documents, not tokens. Reinterpret: 10 TB of documents → assume ~500 bytes per chunk average → 20 billion chunks.
Reality check: that’s absurdly high. Re-interpret: 10 TB of documents at 1 MB per doc → 10M docs → 100 chunks each → 1B chunks.
1B chunks × 768 dim × 4 bytes = 3 TB raw vectors.
With PQ compression (×10 reduction): 300 GB.
Per-node RAM budget 200 GB → ~2 nodes for the raw, ~1 for PQ.
With replication ×3 for HA: 3–6 nodes.

(The key lesson is that “10 TB” is ambiguous and the candidate should ask what the unit is. Chapter 117 makes this explicit.)

Problem 4. Budget for embedding all of Wikipedia.

English Wikipedia ≈ 6M articles × ~1000 tokens each ≈ 6B tokens.
At OpenAI text-embedding-3-small pricing ($0.02/M tokens): $120 total, one-shot.
The cost is so small that embedding everything is nearly free. The bottleneck is not money; it’s throughput.

115.11 The mental model

Eight points to take into Chapter 116:

Memorize seventeen anchor numbers. Everything else derives from them.
The latency hierarchy is two orders of magnitude per layer. L1 → RAM → SSD → HDD. HBM is ~150 ns; same-AZ network is ~500 μs; cross-region is ~100 ms.
H100 = 3.35 TB/s HBM, ~989 TFLOP/s bf16, ~$2/hr reserved. H200 = same compute, 60% more bandwidth. B200 = 2× H100 on both axes at 2.5× the price.
70B bf16 on 2× H100 ≈ 1200 tokens/sec total. Small model ≈ 5000. Frontier ≈ 800.
Self-hosted 70B ≈ $0.45/M tokens at good utilization. Hosted API ≈ $0.80. GPT-4o output ≈ $10. That’s the cost ladder.
The KV cache formula is 2 × n_layers × n_kv_heads × d_h × bytes per token. Memorize it.
The four derivation chains (serving, KV cache, vector index, storage) cover 90% of estimation problems. Drill them until they take 60 seconds each.
Always name assumptions out loud. The interviewer is grading not the answer but the chain from requirements to numbers.

Chapter 116 is the first worked design problem — serving a chatbot to 1M users — and uses almost every number in this chapter.

Read it yourself

Jeff Dean, “Numbers Every Programmer Should Know” (2012 Stanford talk). The ancestor.
The NVIDIA H100 and H200 data sheets on nvidia.com — the authoritative source for HBM capacity, bandwidth, and TFLOPs.
The vLLM benchmarks repository on GitHub — realistic throughput numbers for many model/GPU combinations.
The Anthropic and OpenAI pricing pages, updated whenever they change.
Martin Kleppmann, Designing Data-Intensive Applications, chapter 1 (latency hierarchy) and the throughput tables in chapter 11.
Peter Bailis et al., “Scalable Atomic Visibility with RAMP Transactions” — for the theoretical grounding on why distributed systems have the latency floors they do.

Practice

Recite the seventeen anchor numbers from §115.1 from memory. Aim for under 60 seconds.
Compute the KV cache for Llama 3 8B at 32k context, bf16, for 100 concurrent users. Does it fit on 1× H100?
Estimate the monthly cost of serving Mistral 7B to 2M DAU, with 10 requests per day per user and 400 output tokens per request, at 80% utilization.
A company is embedding 50M product descriptions averaging 200 tokens each. Estimate cost on OpenAI text-embedding-3-small vs self-hosted BGE-large on 1× A10.
Work Chain 3 (vector index sizing) for 500M chunks of 768-dim vectors. What’s the RAM footprint, and how many nodes at 256 GB per node?
A team is logging every request to Kafka at 50k QPS, 2 KB per record, with 30-day retention. Compute monthly storage and Kafka broker count assuming 300 MB/s per broker.
Stretch: Pick a real cloud provider (AWS, GCP, or Azure) and look up the current on-demand price for H100, H200, and the largest-RAM compute instance. Compute the self-hosted cost of serving Llama 70B in bf16 at 1000 tokens/sec realistic throughput, on-demand. Compare to Together AI’s current hosted price for the same model.