Part X · ML System Design Interview Playbook
Chapter 122 ~17 min read

The vocabulary that interviewers respect

"Using the right word is never pedantry. It's a handshake: 'I've been where you've been, and I know what this thing is called"

Senior ML systems interviews are conducted in a specialized vocabulary. Each phrase is a compressed reference to a body of knowledge — often a whole chapter from a book like this one. When a candidate uses the phrase correctly, the interviewer understands that the candidate has internalized the underlying concept. When the candidate explains the concept but avoids the phrase, the interviewer assumes the candidate half-remembers it.

This chapter is a reference: fifty phrases that signal senior systems thinking, each with a one-line definition, an example of how to use it in context, and the interview signal it sends. Use it as a vocabulary trainer. Read through it once, find the three or four phrases you weren’t using correctly, and integrate them into your next mock interview.

graph TD
  REL["Reliability\nat-least-once · idempotency key\nfail-closed/open · blast radius\ndead letter queue · graceful degradation"]
  LAT["Latency / Throughput\ntail latency · Little's Law\nhead-of-line blocking\nwarm path vs cold path"]
  CAP["Capacity / Scaling\ncell-based architecture\nbulkhead pattern\nnoisy neighbor\ncapacity headroom"]
  DATA["Data / Consistency\nread-your-writes\ntrain/serve consistency\nappend-only log\ncontent-addressed storage"]
  FAIL["Failure / Recovery\nRTO / RPO · error budget\nblameless postmortem\nrunbook toil"]
  ML["ML-Specific\nKV cache pressure · prefix hit rate\ncontinuous batching · PD disaggregation\nMFU / MBU · LLM-as-judge"]
  VOCAB["50 Senior Phrases"]
  VOCAB --> REL
  VOCAB --> LAT
  VOCAB --> CAP
  VOCAB --> DATA
  VOCAB --> FAIL
  VOCAB --> ML
  style VOCAB fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Each domain cluster maps to a different part of the system design space — an interviewer who hears a candidate switch fluently across all six clusters infers senior breadth.

Outline:

  1. Reliability and request lifecycle vocabulary.
  2. Latency and throughput vocabulary.
  3. Capacity and scaling vocabulary.
  4. Data and consistency vocabulary.
  5. Failure and recovery vocabulary.
  6. ML-specific systems vocabulary.
  7. Benchmarking and evaluation vocabulary.
  8. Ops and deployment vocabulary.
  9. The full list as a flashcard.
  10. The mental model.

122.1 Reliability and request lifecycle vocabulary

1. “At-least-once with idempotent consumers.” The standard delivery guarantee for most streaming systems. Messages may be delivered more than once, so consumers must process them idempotently. Use it when discussing Kafka consumers, metering events, webhook receivers. “The metering pipeline is at-least-once with idempotent consumers — we dedupe on request_id before incrementing the counter.” Signal: the candidate knows exactly-once is mostly a lie.

2. “Exactly-once is a lie unless you’re willing to pay for it.” A stronger statement of the same point. Used to push back when someone claims exactly-once semantics. “Kafka transactions give you exactly-once within the cluster, but end-to-end requires idempotent writes on the sink, so in practice we run at-least-once with idempotent consumers.”

3. “Back-pressure, not failure.” The principle that under overload, a system should slow down, not crash. “I’d rather the gateway apply back-pressure than let requests hit the GPUs and OOM — shed load at the front door.” Signal: the candidate has read Reactive Streams and understands Little’s Law.

4. “Shed load before queueing.” Same principle, different angle. If the queue grows unbounded, latency degrades for everyone. Reject early. “When the serving fleet’s KV cache hits 90%, I start shedding free-tier traffic rather than queueing it — queueing just means longer tail latency for everyone.”

5. “Fail closed vs fail open.” When a dependency is unavailable, the system either denies the request (fail closed, safe) or allows it with reduced functionality (fail open, available). “The safety service fails open for the chatbot but fails closed for CSAM detection — the cost asymmetry demands it.” Signal: the candidate understands tradeoffs have policy consequences.

6. “Blast radius.” The set of users or systems affected by a failure. Used for designing isolation boundaries. “Sharding tenants by region reduces blast radius — a bad deploy in us-east doesn’t take down eu-west.” Signal: the candidate thinks about cell-based architecture.

7. “Idempotency key.” A client-supplied unique ID that allows safe retries. Standard practice for POST endpoints. “The job submission API takes an Idempotency-Key header, so if the client retries on a timeout, they get the same job_id back instead of two training runs.” Signal: the candidate has actually built retry-safe APIs.

8. “Retry with exponential backoff and jitter.” The standard retry pattern: exponentially growing delays with random jitter to avoid thundering herds. “On rate-limit 429, retry with exponential backoff plus jitter — naive retry loops synchronize and create thundering herds.”

9. “Dead letter queue.” The queue that captures messages that can’t be processed, for later investigation. “Metering events that fail schema validation go to a DLQ with a tag for manual review — we don’t drop them, we don’t retry forever either.”

10. “Graceful degradation.” A system that continues operating with reduced quality rather than outright failing. “When the reranker is down, we fall back to the first-pass retriever’s ordering — graceful degradation instead of an outright 503.”

122.2 Latency and throughput vocabulary

11. “Tail latency vs p50.” The insight that averages lie and the long tail is what users experience. “The p50 TTFT is 150 ms, but the p99 is 800 ms — users hit that tail often enough that it’s the real quality signal, not p50.” Signal: the candidate has read “The Tail at Scale” (Dean and Barroso 2013).

12. “Open-loop vs closed-loop benchmark.” Open-loop: requests arrive at a fixed rate regardless of system response. Closed-loop: the next request waits for the previous to complete. Closed-loop benchmarks under-report saturation. “The vLLM benchmark is closed-loop by default, which hides the system’s true tail behavior — use an open-loop generator for accurate capacity planning.”

13. “Saturate the bottleneck.” The principle that throughput is limited by the slowest stage, and optimizing elsewhere doesn’t help. “There’s no point adding prefill optimization if decode is memory-bandwidth bound — saturate the bottleneck first.”

14. “Amortize the fixed cost.” Distribute a fixed cost over many operations to lower the per-op cost. Common use: batching amortizes the per-step overhead across many requests. “Continuous batching amortizes the weight-read fixed cost across every request in the batch — that’s where the throughput comes from.”

15. “Cache locality.” The insight that fetching data close to where it’s used is much faster than fetching from far away. “Pre-pulling the model weights to node-local NVMe gives cache locality for cold starts — fetching from cross-AZ S3 is 100× slower.”

16. “Warm path vs cold path.” The warm path is the steady-state fast path; the cold path is the slow path triggered by cache misses or cold starts. “The prefix-cache-hit warm path is 200 ms; the cold path with full prefill is 800 ms — we design the workload to maximize warm-path hits.”

17. “Asymptotic vs amortized analysis.” Asymptotic big-O is the worst case; amortized is the average across many operations. “Inserting into an HNSW graph is O(log n) amortized, but individual insertions can be much slower when the graph reconstructs — amortized analysis matches what production sees.”

18. “Head-of-line blocking.” When one slow request delays all the requests behind it. “Long-context requests cause head-of-line blocking in a shared batch — we route them to a dedicated pool to protect short-context tail latency.”

19. “Little’s Law.” L = λW: the average number of items in a system equals the arrival rate times the average time each item spends in the system. Used for queueing analysis. “At 200 QPS with 500 ms p50 latency, Little’s Law says we have ~100 requests in-flight on average — that’s the working set size.”

Little's Law: L equals lambda times W. At 200 QPS and 500ms latency, 100 requests are in-flight at any moment. λ = 200 QPS arrival rate W = 500 ms avg time in system × L = 100 in-flight requests L = λ × W Interview use sizing replica count bounding queue depth justifying KV cache budget
Little's Law converts an abstract QPS × latency pair into a concrete in-flight count — that count is what determines KV cache requirements and minimum replica sizing.

20. “P99 of the dependency is your p99 floor.” You cannot be faster than your slowest dependency’s tail. “Our p99 SLO can’t be below the LLM’s p99 decode time — that’s the floor. If we need tighter, we need to move the LLM off the critical path.”

122.3 Capacity and scaling vocabulary

21. “Provisioned, reserved, on-demand, spot.” The four pricing models for cloud compute. Reserved is for predictable load; spot is for fault-tolerant batch; on-demand is the expensive default. “For the serving fleet, I’d use reserved for base capacity and on-demand for peak; spot is too unreliable for latency-critical workloads.”

22. “Elastic horizontal scaling.” Scaling out (more replicas) automatically as load changes. Opposed to vertical scaling (bigger machines). “LLM serving is elastically horizontal — we scale on num_requests_running up to 200 replicas per region.”

23. “Scale-to-zero.” The ability to drop to zero replicas when idle, saving cost. Valuable for bursty workloads but painful for cold starts. “For low-traffic tenants, scale-to-zero with a 10-second cold start via pre-cached weights on node NVMe — the math works out.”

24. “Cell-based architecture.” Partitioning infrastructure into independent cells (by region, by tenant shard) to limit blast radius. “We run one cell per region, each self-contained with its own gateway, control plane, and data plane — a bad deploy in one cell can’t take down another.”

25. “Bulkhead pattern.” Isolating resource pools so that one overloaded service can’t starve another. “We bulkhead the embedding pool from the reranker pool — if the reranker is slow, we don’t want embeddings to back up too.”

26. “Capacity headroom.” The fraction of peak capacity held in reserve for spikes and rolling deployments. “We run at 70% average utilization in production, leaving 30% headroom for traffic spikes and rolling upgrades — the cost is real but so is the protection.”

27. “Noisy neighbor.” A tenant in a shared pool whose high usage degrades other tenants’ performance. “We detect noisy neighbors by comparing per-tenant TTFT to pool-average — a 2σ deviation triggers a soft isolation move.”

122.4 Data and consistency vocabulary

28. “Read-your-writes guarantee.” Once a client has written, subsequent reads by that client see the write. Weaker than linearizability but stronger than eventual consistency. “For user profiles, we need read-your-writes — after a user updates their preferences, their next request must see the update.”

29. “Eventual consistency.” Given no new writes, all replicas will converge. No guarantee of when. “The metrics aggregation is eventually consistent — a per-tenant counter may lag by 30 seconds, which is fine for billing but not for quota enforcement.”

30. “Strong consistency / linearizability.” Every read sees the latest write. Expensive; usually requires a single leader. “The quota service is strongly consistent — we’d rather take a 5 ms latency hit than let a tenant burst past their cap because a replica was stale.”

31. “Write-through, write-back, write-around cache.” Three cache policies. Write-through writes both cache and backing store (safe, slow). Write-back writes cache first, flushes later (fast, risky on crash). Write-around skips cache for writes. “The session state uses write-back with async persistence to Postgres — we accept losing the last few seconds of state on a crash.”

32. “Content-addressed storage.” Data keyed by its hash, not a name. Deduplicates automatically. “The KV cache for shared prefixes is content-addressed — any request with the same 2k-token system prompt hits the same cache block.”

33. “Append-only log.” A data structure that only grows, never mutates in place. Kafka, Bitcask, LSM trees are all based on this. “The audit log is append-only to object storage — immutable by construction, easy to back up, hard to tamper with.”

34. “Schema evolution.” Changing the structure of stored data over time while maintaining compatibility. “We version the metering event schema; consumers support the last two versions so producers can roll forward without breaking anything.”

35. “Train/serve consistency.” The feature value at serving time must match the feature value the model was trained on. Feature stores exist to enforce this. “The feature store solves train/serve consistency — the same materialization code produces both the offline training set and the online serving features.”

122.5 Failure and recovery vocabulary

36. “RTO and RPO.” Recovery Time Objective (how long to restore service) and Recovery Point Objective (how much data loss is acceptable). “For the serving platform, RTO is 5 minutes and RPO is 0 — we run multi-region active-active so a failover is a routing change, not a restore.”

37. “Blameless postmortem.” An incident review that focuses on system failures, not individual mistakes. Standard SRE culture. “The runbook has a postmortem template — blameless, causal chain, action items with owners.”

38. “Five whys.” A root-cause analysis technique: ask “why” five times to get past symptoms. “The five-whys from last week’s outage: bad config → missing validation → tests skipped in CI → Bazel cache missed rebuild → the rule wasn’t hermetic. Every layer is a fix candidate.”

39. “Error budget.” The fraction of time a service is allowed to be out of SLO. Once consumed, ship less. “Our 99.9% availability SLO gives us 43 minutes of budget per month — we burned 20 in the last incident, so risk-seeking deploys are paused this week.”

40. “Runbook toil.” Manual, repetitive on-call actions that should be automated. SRE goal: reduce toil to 0. “The node-drain runbook used to be 20 clicks; we wrote a kubectl operator that does it in one command — that’s how you reduce toil.”

122.6 ML-specific systems vocabulary

41. “KV cache pressure.” The condition where the KV cache is close to full and the scheduler is forced to drop or evict requests. “Under KV cache pressure, vLLM preempts and re-runs prefill on dropped requests — we autoscale on gpu_cache_usage_perc > 85% to avoid this.”

42. “Prefix hit rate.” The fraction of requests whose prompt prefix is already in the cache. A dominant lever for cost. “On a system-prompted chatbot, prefix hit rate runs at ~80% — that’s where most of our cost savings come from.”

43. “Continuous batching.” Token-level scheduling of multiple requests in a single forward pass; the dominant serving pattern post-Orca. “Continuous batching is the base optimization — without it, throughput collapses because short requests waste the batch.”

44. “Prefill/decode disaggregation.” Running prefill and decode on separate GPU pools to optimize each for its workload shape. “For a vision-language workload where each image is 1000 prefill tokens, PD disaggregation adds 30–50% throughput — the prefill GPU stays busy on images while decode stays busy on tokens.”

45. “Model FLOP utilization (MFU) / Memory bandwidth utilization (MBU).” The fraction of theoretical peak that a workload achieves. “Our 70B decode runs at ~70% MBU on H100s with continuous batching — that’s realistic; anyone claiming 95% is benchmarking a closed-loop.”

46. “Chinchilla-compute-optimal.” Training a model with the data-to-param ratio that minimizes loss for a given compute budget (roughly 20 tokens per parameter). “Llama 3 is Chinchilla-compute-optimal on the training side but under-trained on knowledge — that’s why instruct-tuning matters so much.” Signal: the candidate has read the Chinchilla paper.

47. “LLM-as-judge.” Using a larger LLM to evaluate another model’s outputs. Common in eval pipelines. “The nightly eval uses GPT-4 as a judge over a golden set — we know it’s biased, but the relative deltas between model versions are trustworthy.”

122.7 Benchmarking and evaluation vocabulary

48. “Multi-axis benchmark.” The principle that a single metric (say throughput) is never enough; you need at least per-GPU throughput AND total throughput AND p50 AND p99. “The disaggregated PD benchmark is multi-axis — it’s a wash on per-GPU efficiency, a win on total throughput, and a 40% win on p99 latency, all at once.”

49. “Golden set.” A curated, stable eval set used to gate model releases. Grown over time from production incidents. “Every incident where a user reported a bad response adds an example to the golden set — the set compounds in value over time.”

50. “Contamination.” The phenomenon where benchmark data leaks into training data, inflating scores. A chronic problem in public benchmarks. “MMLU is contaminated in most frontier models — our internal golden set is the only eval we trust for release gates.”

122.8 Ops and deployment vocabulary

Five bonus phrases that didn’t fit in the fifty but are worth knowing:

51. “GitOps.” Configuration lives in Git; a reconciler continuously applies it to the cluster. ArgoCD and Flux are the standards. “Everything in the cluster is GitOps-managed — there are no manual kubectl applies in production.”

52. “Canary deployment.” Gradually shifting traffic to a new version to catch regressions before full rollout. “New model versions canary at 1% → 5% → 25% → 100% with automatic rollback on any SLO breach.”

53. “Distroless image.” A minimal container image with no shell or package manager, just the binary and its runtime. Standard for production. “We build distroless images for the serving containers — smaller attack surface, faster pulls, no shell for intruders to escape into.”

54. “Pod disruption budget.” A Kubernetes primitive that limits how many pods can be voluntarily killed at once during a rollout or drain. “The serving Deployment has a PDB of minAvailable: 80% — that protects against accidental mass eviction during node upgrades.”

55. “Shift-left security.” Integrating security into development workflows (scanning in CI, secrets in vault, code review) rather than treating it as a separate audit. “We shift security left with image scans in CI, secret detection on every PR, and a mandatory vuln review before deploy.”

122.9 The full list as a flashcard

Skim this list before an interview. If any phrase is unfamiliar, open the section and re-read the definition.

  1. At-least-once with idempotent consumers
  2. Exactly-once is a lie unless you pay
  3. Back-pressure not failure
  4. Shed load before queueing
  5. Fail closed vs fail open
  6. Blast radius
  7. Idempotency key
  8. Retry with exponential backoff and jitter
  9. Dead letter queue
  10. Graceful degradation
  11. Tail latency vs p50
  12. Open-loop vs closed-loop benchmark
  13. Saturate the bottleneck
  14. Amortize the fixed cost
  15. Cache locality
  16. Warm path vs cold path
  17. Asymptotic vs amortized
  18. Head-of-line blocking
  19. Little’s Law
  20. P99 of the dependency
  21. Provisioned/reserved/on-demand/spot
  22. Elastic horizontal scaling
  23. Scale-to-zero
  24. Cell-based architecture
  25. Bulkhead pattern
  26. Capacity headroom
  27. Noisy neighbor
  28. Read-your-writes
  29. Eventual consistency
  30. Strong consistency / linearizability
  31. Write-through/back/around
  32. Content-addressed storage
  33. Append-only log
  34. Schema evolution
  35. Train/serve consistency
  36. RTO / RPO
  37. Blameless postmortem
  38. Five whys
  39. Error budget
  40. Runbook toil
  41. KV cache pressure
  42. Prefix hit rate
  43. Continuous batching
  44. Prefill/decode disaggregation
  45. MFU / MBU
  46. Chinchilla-compute-optimal
  47. LLM-as-judge
  48. Multi-axis benchmark
  49. Golden set
  50. Contamination

122.10 The mental model

Eight points to take into Chapter 123:

  1. Vocabulary is a handshake. Using the right phrase signals you’ve built the thing; explaining around it signals you’ve read about the thing.
  2. The fifty phrases cover six domains: reliability, latency, capacity, data, failure, and ML-specific. Each domain has ~8 phrases.
  3. Use phrases in context, not as namedrops. “At-least-once with idempotent consumers” is fine; “we have at-least-once semantics” is half a sentence.
  4. Mispronouncing a phrase is worse than avoiding it. If you’re not sure what “linearizability” means, don’t use it. Use “strong consistency” instead.
  5. Pair every technical phrase with a tradeoff. “We use at-least-once because exactly-once would cost us X” is stronger than just naming the pattern.
  6. Senior interviewers will try to test a phrase you used. If you say “PD disaggregation,” expect the next question to be “when does PD not help?” Be ready.
  7. Replace corporate jargon with precise vocabulary. “Enhance” → “amortize”. “Leverage” → “exploit the common structure of.” Precise is senior.
  8. The phrases compound across an interview. Using 10 of them correctly over 45 minutes shifts the interviewer’s model of your experience from junior to senior without you having to claim it.

Chapter 123 is the final chapter of the book proper: what to do when you realize, 15 minutes in, that you went down the wrong path. The reset move, the recovery move, and the graduation.


Read it yourself

  • Dean and Barroso, The Tail at Scale (2013). For the tail latency vocabulary.
  • Kleppmann, Designing Data-Intensive Applications — the consistency, replication, and system-design vocabulary.
  • Betsy Beyer et al., Site Reliability Engineering (Google SRE book). For SLO, error budget, blameless postmortem.
  • Kleppmann, “A Critique of the CAP Theorem” (2015) and “Please stop calling databases CP or AP” for the consistency vocabulary.
  • Hoare, Communicating Sequential Processes — for the back-pressure and queueing vocabulary’s intellectual roots.
  • The Netflix Tech Blog — every post uses this vocabulary in context; read five posts and note the phrasing.

Practice

  1. Pick five phrases you don’t use comfortably. For each, write one sentence using it in context for a specific design decision.
  2. Record yourself answering “how would you design a chatbot for 1M users?” in 60 seconds. Count the number of phrases from this chapter you naturally used. Target: 8+.
  3. Translate this corporate sentence into senior vocabulary: “We will leverage our robust infrastructure to enhance availability for mission-critical workloads.”
  4. For each of the 50 phrases, identify which of Chapters 1–119 is the deepest reference. Build a cross-reference table.
  5. A junior engineer uses “exactly-once semantics” in a design review. Write a one-paragraph response that corrects them without being condescending.
  6. An interviewer asks “what is an error budget?” Answer in 30 seconds with an example.
  7. Stretch: Build a personal “phrase diary” where you note each phrase from this chapter as you use it in a real meeting or interview over the next month. After 30 days, review which phrases you’ve used, which you’ve avoided, and which you’ve misused.