Part X · ML System Design Interview Playbook
Chapter 115 ~18 min read

Capacity planning math: the back-of-envelope kit

"A senior candidate doesn't compute the numbers. They recognize them"

Chapter 114 established the interview framework. The estimate phase — phase two of five — is the one that separates senior from mid-level most visibly. It is pure numerical fluency: can the candidate convert “1M users” into “300 H100s and $500k a month” in under two minutes, without a calculator, without hesitation? This chapter is the numerical kit that makes that possible.

The goal is not to memorize every number. The goal is to memorize the handful that everything else derives from, and to know the derivation paths well enough that the rest fall out. This chapter lists those anchor numbers, the derivation chains that use them, and the practice estimation problems that burn them into muscle memory. A candidate who drills this chapter for a week can walk into any senior ML systems interview and do capacity math faster than the interviewer can follow.

Outline:

  1. The anchor numbers you memorize.
  2. The latency table — L1 to inter-region.
  3. GPU specs that matter — H100, H200, B200.
  4. Bytes per op for each storage class.
  5. Tokens per second per GPU per model.
  6. Dollars per million tokens.
  7. Network throughput across the stack.
  8. Databases and IOPS.
  9. The derivation chains.
  10. Worked estimation problems.
  11. The mental model.

115.1 The anchor numbers you memorize

These are the seventeen numbers that every senior candidate has in working memory. Everything else in this chapter is derived from them.

QuantityValue
L1 cache hit1 ns
RAM access100 ns
SSD random read100 μs
HDD seek10 ms
Same-zone network round trip500 μs
Cross-region network round trip50–150 ms
H100 HBM bandwidth3.35 TB/s
H100 bf16 peak~989 TFLOP/s
H100 FP8 peak~2 PFLOP/s
H100 HBM capacity80 GB
H100 on-demand cloud price$3–4/hr
H100 reserved cloud price~$2/hr
Llama 70B bf16 realistic decode throughput on 1× H100~1000 tokens/s
Llama 8B bf16 realistic decode throughput on 1× H100~5000 tokens/s
GPT-4o output price~$10 / M tokens
Self-hosted 70B cost~$0.50 / M tokens at good utilization
Seconds in a day86,400

Memorize these. They are the axes of every conversation. An interviewer who hears the candidate say “H100 is about 3 terabytes per second on HBM” relaxes, because it signals the candidate is operating at the right level.

A useful trick: seconds in a day is 86,400. Round to 100,000 for mental math. The 14% error is irrelevant in an estimate phase where the answer is good to within 2×.

115.2 The latency table — L1 to inter-region

Jeff Dean’s “Numbers Every Programmer Should Know” (2012) is the canonical version. Updated for 2025 with NVMe, HBM, and cross-region reality:

OperationLatencyRelative to L1
L1 cache hit1 ns
Branch mispredict3 ns
L2 cache hit4 ns
L3 cache hit15 ns15×
Main memory (DRAM)100 ns100×
HBM access on H100~150 ns150×
Compress 1 KB with zstd1 μs1000×
Send 1 KB over 10 Gbps1 μs1000×
SSD random 4 KB read100 μs100,000×
NVMe sequential 1 MB read200 μs200,000×
Same-AZ round trip500 μs500,000×
Same-region (different AZ) round trip1 ms1,000,000×
HDD seek10 ms10,000,000×
Inter-region round trip (same continent)30–80 ms30–80M×
Inter-region round trip (US-EU)80–100 ms80–100M×
Inter-region round trip (US-APAC)150–200 ms150–200M×

Two patterns to internalize. First, RAM is 100× slower than L1, SSD is 1000× slower than RAM, HDD is 100× slower than SSD. Each layer is roughly two orders of magnitude worse.

Latency hierarchy: each storage layer is roughly two orders of magnitude slower than the one above it. L1 cache — 1 ns RAM / HBM — 100 ns NVMe SSD — 100 μs Same-AZ network — 500 μs HDD — 10 ms Cross-region — 100 ms ×100 ×1000 ×5 ×20 ×10
Each storage layer is ~100–1000× slower than the one above — knowing the crossover points tells you when a cache hit changes architecture decisions.
Second, **same-AZ network is comparable to SSD latency** — this is why distributed caches work, and why same-AZ microservices are not as expensive as naive intuition suggests.

For interview purposes, the two numbers to use constantly: 500 μs (same-AZ network), 100 ms (cross-region). When the interviewer asks “should this be in-process or cross-service?”, the answer depends on whether 500 μs × N fits the latency budget.

115.3 GPU specs that matter

The four GPUs a senior candidate has memorized for 2025–2026 interviews:

GPUHBMBandwidthbf16 TFLOPsFP8 TFLOPsLaunched
A100 80GB80 GB HBM2e2.0 TB/s3122020
H100 SXM80 GB HBM33.35 TB/s98919792022
H200 SXM141 GB HBM3e4.8 TB/s98919792024
B200192 GB HBM3e8.0 TB/s225045002024–2025

The key insight interviewers probe for: H100 and H200 have the same compute. H200 is an H100 with more HBM capacity and more HBM bandwidth. For memory-bandwidth-bound workloads like decode, H200 is ~43% faster than H100 because bandwidth is 4.8/3.35 ≈ 1.43×. For compute-bound workloads like prefill or training, they are identical.

B200 roughly doubles both compute and bandwidth over H100 (8 TB/s vs 3.35 TB/s, 2250 vs 989 TFLOP/s). It is also ~2.5× more expensive. The dollars-per-token math almost always works out to a wash at current pricing — B200 wins on latency, not on cost efficiency (see Chapter 30).

For AMD MI300X: 192 GB HBM3, 5.3 TB/s bandwidth, similar bf16 TFLOPs to H100. Important to name because it is the main non-NVIDIA option, but as of 2026 software maturity lags. A senior candidate mentions MI300X as a tradeoff point, not a default.

The sanity check you should do in your head: the roofline of any dense LLM inference workload is HBM_bandwidth / model_size. For Llama 70B bf16 (140 GB) on H100 (3.35 TB/s), that’s 24 weight reads per second at batch=1, meaning 24 tokens per second per request at the memory limit. Batch of 64 pushes total throughput to ~1500 tokens/sec. The “realistic 1000 tokens/sec” number is this formula minus some efficiency loss.

115.4 Bytes per op for each storage class

Storage classes ranked by bytes-per-dollar and bytes-per-second:

ClassCost/GB/monthRead bandwidth (single device)Typical size
HBM— (on-GPU)3.35 TB/s (H100)80–192 GB
DRAM~$4100 GB/s1 TB (server)
NVMe SSD~$0.107 GB/s30 TB (dense server)
SATA SSD~$0.05500 MB/s4 TB
HDD~$0.02200 MB/s20 TB
Object storage (S3)~$0.023100 MB/s per connection, ~GB/s parallelunbounded
Glacier~$0.004minutesunbounded

The two rules that follow:

HBM is ~30× faster than DRAM and ~500× faster than NVMe. When a KV cache gets evicted from HBM to DRAM, the fetch cost is tens of microseconds per block. When it gets evicted to NVMe, it’s tens of milliseconds. This is why KV cache offloading (Chapter 37) is a latency compromise, not a free win.

Object storage is cheap but slow per-connection. Pulling a 140 GB model weight file from S3 over a single connection takes ~1400 seconds at 100 MB/s. Pulling it in parallel with 64 connections takes ~20 seconds. The cold-start problem (Chapter 52) is entirely about which layer of this hierarchy the weights live on when a pod starts.

For capacity planning: if the question is “how much data can we keep hot?”, the answer is bounded by DRAM cost. If it’s “how much can we keep warm?”, the answer is NVMe. If it’s “how much can we keep cold?”, the answer is S3 and the budget is almost unlimited.

115.5 Tokens per second per GPU per model

The serving-throughput table for bf16, realistic (not peak), assuming vLLM with continuous batching at healthy batch sizes:

ModelGPUTotal tokens/s (realistic)Batch size used
Llama 3 8B1× H100~500064–128
Llama 3 8B1× A100 80~300064
Llama 3 70B2× H100 TP=2~1200 total64
Llama 3 70B1× H200~150064
Llama 3 405B8× H100 TP=8~80032
Mixtral 8x7B2× H100 TP=2~200064
Qwen 2.5 72B2× H100 TP=2~120064

With FP8 or INT4 quantization, multiply by 1.5–2.5×. With speculative decoding and good draft models, multiply by another 1.3–1.8×. With disaggregated prefill/decode on vision-heavy workloads (Chapter 36), add 30–50% throughput on the shared fleet.

The interview-relevant shortcuts:

  • 8B on 1× H100 ≈ 5k tokens/s. A small model “fills” an H100 at ~5000 tokens/sec.
  • 70B on 1× H100 (FP8) ≈ 1500 tokens/s. A medium model on one H100 with FP8.
  • 70B on 2× H100 bf16 ≈ 1200 tokens/s. The tensor-parallel version if you need bf16.
  • 400B on 8× H100 ≈ 800 tokens/s. The frontier-ish open model, needs 8 GPUs, hurts.

The fifth number is the one candidates get wrong most often: frontier models are slow per-GPU because most of the parallelism overhead eats the gains. Don’t pitch 70B-style throughput for a 405B.

115.6 Dollars per million tokens

The dollar numbers interviewers expect candidates to cite confidently:

API prices (late 2025 / 2026, rough):

ModelInput $/MOutput $/M
GPT-4o$2.50$10
GPT-4o mini$0.15$0.60
Claude Sonnet (latest)$3$15
Claude Haiku$0.25$1.25
Llama 70B (Together, Fireworks, DeepInfra)$0.60–$0.90$0.60–$0.90
Llama 8B hosted$0.05–$0.20$0.05–$0.20
Embedding (OpenAI text-embedding-3-small)$0.02
Embedding (OpenAI text-embedding-3-large)$0.13
Reranker (Cohere rerank)~$2 per 1k searches

Self-hosted cost (output tokens, at good utilization):

Config$/M tokens
8B bf16 on H100, 80% util~$0.11
70B bf16 on H100×2, 80% util~$0.58
70B FP8 on H100, 80% util~$0.25
70B INT4 on H100, 80% util~$0.18
405B bf16 on H100×8, 80% util~$2.00

The derived rules:

  • Self-hosted 70B is 20× cheaper than GPT-4o output per token.
  • Hosted 70B on Together/Fireworks is ~$0.80 per million. That’s the benchmark for “should I self-host or not?” Below ~100M tokens/month, use hosted. Above ~500M tokens/month, self-host.
  • Input tokens are 2–4× cheaper than output tokens on every API. This reflects prefill (compute-bound) being cheaper than decode (bandwidth-bound) per token, plus aggregate provider economics.
  • Embeddings are basically free. At $0.02 per million tokens, you can afford to embed everything. Tens of millions of documents at ~500 tokens each costs pennies.

115.7 Network throughput across the stack

The throughput numbers across a typical data center stack:

LinkBandwidth
PCIe Gen4 x1632 GB/s
PCIe Gen5 x1664 GB/s
NVLink per GPU pair (H100)900 GB/s bidirectional
NVSwitch (H100 system, full 8-way)900 GB/s per GPU
100 GbE NIC12.5 GB/s
200 GbE NIC25 GB/s
400 GbE NIC50 GB/s
InfiniBand HDR (200 Gbps)25 GB/s
InfiniBand NDR (400 Gbps)50 GB/s
Same-region inter-AZ bandwidth (cloud)10 Gbps typical
Cross-region bandwidth (cloud)1–10 Gbps typical

The rules that matter:

  • NVLink is ~30× faster than PCIe Gen4 and ~70× faster than 100 GbE. Tensor parallel across NVLink is viable; tensor parallel across Ethernet is not.
  • Ethernet inter-node is the bottleneck for multi-node training. This is why InfiniBand dominates training clusters.
  • Same-region cross-AZ at 10 Gbps is ~1.25 GB/s. Pulling a 140 GB model across AZs takes ~2 minutes. This is why cold starts from S3 across AZs are painful.

For inference, the numbers most often used: 900 GB/s NVLink inside a node, 25 GB/s per NIC between nodes, 500 μs inter-node round trip at the application layer.

115.8 Databases and IOPS

The reference numbers for storage systems behind an ML platform:

SystemRead QPS (per node)Write QPSp99 latency
Redis (in-memory)100k+100k+<1 ms
Memcached100k+100k+<1 ms
PostgreSQL (point query on SSD)10k–50k5k–20k1–10 ms
MySQL (point query on SSD)10k–50k5k–20k1–10 ms
MongoDB10k–30k5k–15k1–10 ms
DynamoDB (per partition)3000 reads, 1000 writes<10 ms
Cassandra10k–20k5k–10k1–50 ms
Elasticsearch (search)1k–5k1k–5k10–100 ms
Kafka (per broker)1M+ messages/s at small payload1–10 ms
HNSW vector index (FAISS)1k–10k searches/s per core1–10 ms
IVF vector index10k–100k/s per core1–5 ms

Key interview facts:

  • Redis can do 100k QPS. That is the ceiling you assume for any cache layer.
  • Postgres can do 10k QPS for point queries. Range scans and joins are much slower; the 10k number is for primary-key lookups.
  • DynamoDB partitions cap at ~3000 reads/s. If your hot key exceeds this, you need to shard manually or pre-fan-out.
  • Kafka can absorb millions of messages per second per broker. That is why it is the backbone of every telemetry/metering pipeline.
  • A single-core HNSW index does 1–10k vector searches per second. Scaling requires sharding, not vertical.

115.9 The derivation chains

The four derivation chains every senior candidate can run cold, in under 90 seconds each.

Chain 1 — LLM serving from users to GPUs.

DAU → sessions/user/day → requests/session → tokens/request
  → total tokens/day → tokens/sec avg → tokens/sec peak (×3-5)
  → GPUs at peak (tokens/sec / per-GPU-throughput)
  → GPUs with headroom (×1.3)
  → $/month (GPUs × $2/hr × 24 × 30)

Chain 2 — KV cache memory budget.

n_layers × n_kv_heads × d_h × 2 (for K and V) × bytes_per_elem
  = per-token KV size
  × context_length_per_user = per-user cache size
  × concurrent_users = total cache requirement

Chain 3 — Vector index sizing.

num_documents × chunks_per_doc × tokens_per_chunk → total chunks
  × embedding_dim × 4 bytes (fp32) = raw vector bytes
  × 1.2 (HNSW overhead) = index size
  / per-node RAM budget = nodes required

Chain 4 — Storage cost for logs and traces.

QPS × bytes_per_record × 86400 s/day = bytes/day
  × retention_days = bytes at rest
  × $0.023/GB/month (S3) or $4/GB/month (DRAM cache)
  = $/month for storage
  + egress cost (rarely matters)

Master these four and nearly every estimation phase in every ML systems interview is a matter of plugging the anchor numbers from §115.1–§115.8 into one or two of them.

graph LR
  DAU["DAU"] --> SESS["× sessions/user/day"]
  SESS --> REQ["× requests/session"]
  REQ --> TOK["× tokens/request"]
  TOK --> TPDAY["tokens / day"]
  TPDAY --> TPSEC["÷ 86,400 → avg tok/sec"]
  TPSEC --> PEAK["× 3–5 → peak tok/sec"]
  PEAK --> GPUS["÷ per-GPU throughput → GPU count"]
  GPUS --> HEAD["× 1.3 headroom"]
  HEAD --> COST["× $/GPU-hr × 720 → $/month"]
  style PEAK fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Chain 1 (LLM serving): the peak tokens-per-second node is the pivot — everything before it is user math, everything after is infrastructure math.

115.10 Worked estimation problems

Problem 1. Serve Llama 3 70B to 500k DAU, each doing 5 chat turns per day with ~600 tokens of output per turn. What’s the GPU count and monthly cost?

Walking it:

  • 500k × 5 × 600 = 1.5B tokens/day output.
  • 1.5B / 86,400 ≈ 17,400 tokens/sec average.
  • Peak (×4) ≈ 70,000 tokens/sec.
  • Per H100 (70B bf16 on 2× H100 TP=2): ~1200 tokens/sec per pair, so ~600 per GPU effective. Round to “500 tokens/sec per GPU” for safety.
  • 70,000 / 500 = 140 GPUs at peak.
  • With 30% headroom: ~180 GPUs.
  • 180 × $2/hr × 720 hr/month = ~$260k/month.
  • In $/M-token terms: $260k / 1.5B × 1M / 30 ≈ $5.80/M tokens (high because we sized for peak, not average).

Problem 2. How much HBM does the KV cache consume for 200 concurrent Llama 70B requests, each at 8k context, in bf16?

  • Per-token cache: 2 × 80 layers × 8 kv-heads × 128 d_h × 2 bytes ≈ 320 KB.
  • Per request: 320 KB × 8192 ≈ 2.6 GB.
  • 200 × 2.6 = 520 GB.

This is across all tensor-parallel shards. On 4× H100, that’s 130 GB per GPU — impossible. On 8× H200 (141 GB each, 1128 GB total), it fits. The interviewer’s follow-up: “what would you do instead?” Answer: INT8 KV cache (halves it to 260 GB), GQA reduction (already at GQA 8), prefix caching (if system prompts are shared), shorter contexts.

Problem 3. RAG index over 10 TB of text. How big is the vector index, and how many nodes to serve it from RAM?

  • 10 TB ≈ 10 TB of tokens at 4 bytes/char / 4 chars/token ≈ ~2.5 trillion tokens. Too many; must be documents, not tokens. Reinterpret: 10 TB of documents → assume ~500 bytes per chunk average → 20 billion chunks.
  • Reality check: that’s absurdly high. Re-interpret: 10 TB of documents at 1 MB per doc → 10M docs → 100 chunks each → 1B chunks.
  • 1B chunks × 768 dim × 4 bytes = 3 TB raw vectors.
  • With PQ compression (×10 reduction): 300 GB.
  • Per-node RAM budget 200 GB → ~2 nodes for the raw, ~1 for PQ.
  • With replication ×3 for HA: 3–6 nodes.

(The key lesson is that “10 TB” is ambiguous and the candidate should ask what the unit is. Chapter 117 makes this explicit.)

Problem 4. Budget for embedding all of Wikipedia.

  • English Wikipedia ≈ 6M articles × ~1000 tokens each ≈ 6B tokens.
  • At OpenAI text-embedding-3-small pricing ($0.02/M tokens): $120 total, one-shot.
  • The cost is so small that embedding everything is nearly free. The bottleneck is not money; it’s throughput.

115.11 The mental model

Eight points to take into Chapter 116:

  1. Memorize seventeen anchor numbers. Everything else derives from them.
  2. The latency hierarchy is two orders of magnitude per layer. L1 → RAM → SSD → HDD. HBM is ~150 ns; same-AZ network is ~500 μs; cross-region is ~100 ms.
  3. H100 = 3.35 TB/s HBM, ~989 TFLOP/s bf16, ~$2/hr reserved. H200 = same compute, 60% more bandwidth. B200 = 2× H100 on both axes at 2.5× the price.
  4. 70B bf16 on 2× H100 ≈ 1200 tokens/sec total. Small model ≈ 5000. Frontier ≈ 800.
  5. Self-hosted 70B ≈ $0.45/M tokens at good utilization. Hosted API ≈ $0.80. GPT-4o output ≈ $10. That’s the cost ladder.
  6. The KV cache formula is 2 × n_layers × n_kv_heads × d_h × bytes per token. Memorize it.
  7. The four derivation chains (serving, KV cache, vector index, storage) cover 90% of estimation problems. Drill them until they take 60 seconds each.
  8. Always name assumptions out loud. The interviewer is grading not the answer but the chain from requirements to numbers.

Chapter 116 is the first worked design problem — serving a chatbot to 1M users — and uses almost every number in this chapter.


Read it yourself

  • Jeff Dean, “Numbers Every Programmer Should Know” (2012 Stanford talk). The ancestor.
  • The NVIDIA H100 and H200 data sheets on nvidia.com — the authoritative source for HBM capacity, bandwidth, and TFLOPs.
  • The vLLM benchmarks repository on GitHub — realistic throughput numbers for many model/GPU combinations.
  • The Anthropic and OpenAI pricing pages, updated whenever they change.
  • Martin Kleppmann, Designing Data-Intensive Applications, chapter 1 (latency hierarchy) and the throughput tables in chapter 11.
  • Peter Bailis et al., “Scalable Atomic Visibility with RAMP Transactions” — for the theoretical grounding on why distributed systems have the latency floors they do.

Practice

  1. Recite the seventeen anchor numbers from §115.1 from memory. Aim for under 60 seconds.
  2. Compute the KV cache for Llama 3 8B at 32k context, bf16, for 100 concurrent users. Does it fit on 1× H100?
  3. Estimate the monthly cost of serving Mistral 7B to 2M DAU, with 10 requests per day per user and 400 output tokens per request, at 80% utilization.
  4. A company is embedding 50M product descriptions averaging 200 tokens each. Estimate cost on OpenAI text-embedding-3-small vs self-hosted BGE-large on 1× A10.
  5. Work Chain 3 (vector index sizing) for 500M chunks of 768-dim vectors. What’s the RAM footprint, and how many nodes at 256 GB per node?
  6. A team is logging every request to Kafka at 50k QPS, 2 KB per record, with 30-day retention. Compute monthly storage and Kafka broker count assuming 300 MB/s per broker.
  7. Stretch: Pick a real cloud provider (AWS, GCP, or Azure) and look up the current on-demand price for H100, H200, and the largest-RAM compute instance. Compute the self-hosted cost of serving Llama 70B in bf16 at 1000 tokens/sec realistic throughput, on-demand. Compare to Together AI’s current hosted price for the same model.