Part X · ML System Design Interview Playbook
Chapter 114 ~20 min read

The interview framework: clarify, estimate, design, drill, ops

"The interviewer doesn't grade the answer. They grade the process of arriving at one"

Part X is the capstone. The previous 111 chapters built a field. This chapter is the first tool in the toolbox that turns that field into a 45-minute performance. An ML systems design interview is not an exam on the material — it is a simulation of what it looks like to sit across from a senior engineer and talk through a production problem. The candidates who pass have a process. The candidates who fail have memorized a reference architecture.

The process has five phases: clarify, estimate, design, drill, ops. Each phase is a specific mode the candidate operates in, with specific outputs, specific time budgets, and specific things the interviewer is listening for. This chapter is the framework. Every chapter in Part X that follows is either a practice drill on this framework (114–119) or a support chapter for executing it well (113, 120, 121).

Outline:

  1. What the interviewer is actually grading.
  2. The 45-minute anatomy.
  3. Phase 1 — clarify.
  4. Phase 2 — estimate.
  5. Phase 3 — high-level design.
  6. Phase 4 — drill into one or two components.
  7. Phase 5 — operations and failure modes.
  8. The meta-game — showing tradeoffs without being asked.
  9. How to push back on under-specified questions.
  10. Common failure modes in each phase.
  11. The mental model.

114.1 What the interviewer is actually grading

The scoring rubric for senior ML systems design interviews is rarely written down, but it is consistent across companies. Five axes:

Judgment under ambiguity. The question is deliberately vague. “Design a chatbot for 1M users” says nothing about latency budgets, quality targets, freshness needs, or budget. The interviewer wants to see whether the candidate slows down to ask, or whether they fabricate requirements and race to a diagram. Fabrication is the single most common failure mode.

Numerical fluency. Can the candidate do the back-of-envelope math — tokens per second per H100, $ per million tokens, RAM for a KV cache — in their head, without blinking? A mid-level candidate knows the formulas exist. A senior candidate knows the numbers. Chapter 115 is the kit.

Tradeoff awareness. For every design decision, can the candidate articulate “I could have done X, I chose Y, here is what I gave up”? A design without tradeoffs is a design the candidate memorized. A design with tradeoffs is a design the candidate reasoned to.

Operational realism. Does the candidate know what goes wrong at 2 a.m.? Rollouts, canaries, cold starts, the on-call experience, the blast radius of a bad deploy. A senior candidate tours ops as a peer of ops; a mid-level candidate tours ops as a tourist.

Recovery and self-correction. When the interviewer pushes back, does the candidate update their design cleanly, or do they dig in? This is the “reset move” (Chapter 123) — graceful mid-flight correction is a senior signal, not a weakness.

The candidates who win are not the ones who know the most. They are the ones who are the calmest under pressure, numerate, and willing to be wrong for the right reasons.

114.2 The 45-minute anatomy

A typical interview is 45 minutes, with a small amount of intro chat and a small amount of closing chat. That leaves roughly 40 minutes of design work. The time allocation that works:

PhaseTimeOutput
Clarify5 minA list of requirements the interviewer has ratified
Estimate5 minDAU, QPS, storage, GPU count, $ envelope
High-level design10 minBlock diagram of the full system
Drill (component 1)10 minDeep dive on one critical piece
Drill (component 2, if time)5 minA second dive
Ops and wrap5 minMonitoring, rollout, failure modes
The 45-minute interview broken into five time-boxed phases, each with a fixed output. Clarify Estimate High-level Design Drill Ops Q&A 5 min 5 min 10 min 10 min 5 min 5 min requirements numbers block diagram deep design runbook questions 40 minutes of design work
The drill is the densest phase and the one interviewers grade most heavily — announce the plan in the first 40 seconds so both sides share the clock.

Two things go wrong with time. First, candidates over-clarify — they spend 15 minutes asking questions and have 20 minutes left for design. Second, candidates under-estimate — they skip the capacity math entirely and jump to a diagram. Both are automatic loss conditions.

The right discipline is to announce the plan early. Forty seconds in: “I’m going to clarify requirements for about five minutes, then do a capacity estimate, then sketch a high-level architecture, then drill into whichever component you find most interesting — and save the last five minutes for ops and failure modes. Sound good?” This does three things. It signals process. It gives the interviewer a chance to redirect (“actually, I want to spend most of our time on the retrieval layer”). And it creates a shared clock.

If the interviewer says nothing, assume the clock is running and go.

114.3 Phase 1 — clarify

The clarify phase has one goal: extract enough constraints that the design problem becomes a real problem with a real answer. Under-specified questions are deliberate. The interviewer wants to see which constraints the candidate volunteers.

The standard clarifying questions for any ML systems problem:

Scale. How many users total? How many daily active? What is the peak-to-average ratio? How bursty is the traffic?

Latency. What is the p50 and p99 latency target? Is this an interactive experience (sub-second) or a batch experience (minutes)? For streaming responses (LLMs), what is the target time-to-first-token and time-per-output-token?

Quality. What does “good enough” look like? A classification accuracy target, a human-eval score, a faithfulness metric? Who decides when the model ships?

Budget. Is this a cost-minimization problem or a cost-is-no-object problem? Is the candidate designing for a hyperscaler or a startup? The answer changes everything.

Freshness. How stale can the data be? Seconds, minutes, hours, days? This determines whether the pipeline is real-time, micro-batch, or batch.

Availability. Is this a best-effort service or a 99.99% SLA service? Is a regional outage acceptable or does it need to be multi-region?

Data locality. Are there residency requirements? Is the model allowed to call external APIs? Is user data allowed to leave a region?

Tenancy. Single-tenant or multi-tenant? If multi-tenant, how noisy-neighbor-safe does it need to be?

Ask four to six of these, not all nine. Pick the ones that will most change the design. For a chatbot, ask scale, latency, quality, and budget. For a RAG system, also ask freshness and data locality. For a moderation pipeline, ask availability and the false-positive-vs-false-negative tradeoff. The question itself tells the candidate which clarifiers matter.

Every answer the interviewer gives should be written down — on the whiteboard, in a shared doc, on a notepad. A written list forces the interviewer to ratify the requirements, and it protects against the “I thought you said 100ms” trap mid-drill.

114.4 Phase 2 — estimate

Estimation is where senior candidates separate from mid-level ones. The goal: convert the requirements into concrete numbers for QPS, storage, GPU count, and dollar cost.

The standard estimation chain for an LLM-serving problem:

  1. Users. Start with DAU. If the problem says “1M users,” ask if that means 1M total, 1M MAU, or 1M DAU. Usually it means DAU; if it means MAU, divide by ~3 to get DAU.
  2. Sessions per user per day. Pick a number. For a chat app, 3–10 is reasonable. For a moderation pipeline, it’s upstream QPS, not sessions.
  3. Requests per session. For chat, a session is ~5–20 turns. For a one-shot query, a session is 1 request.
  4. Tokens per request. For chat, ~500 prompt + ~300 output is a reasonable average. For RAG, prompt grows to 2000–4000 from the retrieved context.
  5. Total tokens per day. Multiply.
  6. Tokens per second on average. Divide by 86,400.
  7. Peak QPS. Multiply average by 3–5× for daytime peak.
  8. Per-GPU throughput. From Chapter 30: ~1000 tokens/sec realistic for 70B bf16 on one H100.
  9. GPU count at peak. Divide peak tokens/sec by per-GPU throughput. Add ~30% headroom.
  10. Cost. Multiply GPU count by ~$2/hour × 24 × 30 to get monthly cost.

This whole chain is about 90 seconds of arithmetic once it’s memorized. Chapter 115 drills the numbers; Chapter 116 walks through a full example.

The senior move in estimation is to call out the assumptions as you make them rather than presenting the final number as if it fell from the sky. “I’m going to assume 5 turns per session, which puts us at 5M chat turns per day — if the real number is 20, everything scales 4×.” This shows the interviewer that the candidate knows where the uncertainty lives, and it invites correction: “actually, assume 15 turns — does that change your architecture?”

The estimation phase ends with a one-line summary: “So we’re looking at roughly 300k tokens per second at peak, about 300 H100s, roughly half a million dollars per month for inference alone, not counting the retrieval layer.” That sentence is the pivot into design.

114.5 Phase 3 — high-level design

Ten minutes, one block diagram, cover everything. The goal: show the reference architecture the candidate would implement, at a zoom level where every box represents a real service.

The standard LLM platform reference architecture (which every design question in Part X maps onto):

  [ client ]
      |
      v
  [ edge / CDN ]
      |
      v
  [ API gateway / auth / rate limit ]
      |
      v
  [ admission / routing ]
      |
      v
  [ safety pre-filter ]  <--- rules + small classifier
      |
      v
  [ LLM serving fleet ]  <---> [ prefix cache ]
      |                 <---> [ KV cache fabric ]
      v
  [ safety post-filter ] <--- output moderation
      |
      v
  [ metering / billing event ]
      |
      v
  [ telemetry pipeline ] --> [ metrics / logs / traces ]
graph TD
  Client["Client (browser / mobile)"] --> CDN["Edge / CDN"]
  CDN --> GW["API Gateway\nauth · rate limit · OTel"]
  GW --> ADM["Admission + Router\nquota · model select · region pin"]
  ADM --> PRE["Safety Pre-filter\nregex blocklist + small classifier"]
  PRE --> FLEET["LLM Serving Fleet\nvLLM · continuous batching · TP"]
  FLEET <--> PC["Prefix Cache\nLMCache + Redis"]
  FLEET --> POST["Safety Post-filter\nstreaming-aware output scan"]
  POST --> METER["Metering Sidecar\nKafka billing event"]
  METER --> TEL["Telemetry Pipeline\nPrometheus · Tempo · Loki"]
  style FLEET fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Every ML design problem in Part X maps onto this eight-box skeleton — knowing it cold lets you draw a complete diagram in under three minutes.

For retrieval-heavy problems, add a retrieval box (vector store + reranker) that the serving fleet queries before generation. For multi-tenant problems, add a tenancy isolation layer and a model registry. For fine-tuning problems, add a job scheduler and a training fleet that feeds artifacts to the model registry.

The senior move in the design phase is to label each box with the technology — not because the interviewer cares about brand names, but because it signals the candidate has shipped the thing. “The gateway would be an Envoy AI Gateway for OpenAI compatibility; the serving fleet is vLLM behind KServe; autoscaling is KEDA on num_requests_running; metering is a sidecar that publishes to Kafka.” Each technology choice is a claim the interviewer can challenge, which creates a natural next question.

Do not dwell here. Ten minutes for the whole diagram is tight. The point is to get the full picture on the board so the drill phase has a target.

114.6 Phase 4 — drill into one or two components

The drill is where the candidate earns or loses the job. The interviewer will pick a component — or ask the candidate to pick one — and expect a 10-minute deep dive that is as detailed as a mid-level candidate could do for the whole problem.

The candidate should pick the component that is most interesting for the problem. For a chatbot, drill into LLM serving (KV cache, batching, autoscaling). For RAG, drill into retrieval (hybrid search, reranker, freshness). For moderation, drill into the classification cascade. For multi-tenant serving, drill into GPU sharing or the billing pipeline.

The drill has a shape:

Six-step drill shape: job, lifecycle, constraint, optimizations, failure modes, metrics. 1. Component's job one sentence: what it accepts and produces 2. Request lifecycle first byte → steady state → last byte 3. Dominant constraint KV cache? Batch composition? Tail latency? 4. Optimizations name the techniques that apply here 5. Failure modes overload · cold start · bad deploy 6. Metrics and alerts what you watch, what pages you drill shape
A well-shaped drill looks like a loop from job definition through failure modes — candidates who skip step 3 (dominant constraint) never surface the real engineering problem.
  1. State the component’s job in one sentence. “The LLM serving fleet’s job is to accept an OpenAI-compatible request, produce streaming tokens within the latency budget, and return metering events.”
  2. Walk the request lifecycle through the component. What happens on the first byte, what happens on steady state, what happens on the last byte.
  3. Identify the dominant constraint. KV cache memory? Batch composition? Tail latency? Cold start?
  4. Name the optimizations that apply. Continuous batching (Chapter 23), PagedAttention (Chapter 24), prefix caching (Chapter 29), quantization (Chapter 26), speculative decoding (Chapter 27).
  5. Describe the failure modes. What happens under overload? Under a cold start? Under a bad deployment?
  6. Describe the metrics. What does the candidate watch, and what alerts fire?

A drill done well looks like a conversation about a system the candidate has on-called for. A drill done poorly looks like a recitation of a paper the candidate skimmed.

The interviewer will interrupt with “what about X?” The correct response is to treat every interruption as a cue to deepen, not as a criticism. “Good catch — the KV cache can blow up under long contexts. Here’s how I’d handle it: drop-and-recompute on low-priority requests, offload to CPU DRAM on medium-priority, and reserve HBM for premium traffic.”

114.7 Phase 5 — operations and failure modes

Five minutes at the end. The question: “How does this thing run in production?” The candidate who skips this phase loses the interview in the debrief, even if the design was brilliant.

The topics to hit:

Deployment and rollout. Blue/green, canary, shadow traffic. How new model versions go out (Chapter 98). For LLMs, an ML canary is not a traffic canary — it’s a golden-set eval that runs offline before any traffic is shifted.

Monitoring. The four golden signals for each component. For LLM serving: TTFT p99, TPOT p99, queue depth, KV cache utilization, error rate by class. For retrieval: recall@k, latency p99, freshness lag.

Autoscaling. What triggers a scale-up? What’s the cold-start latency? What’s the minimum replica count (you almost always want min > 0 for LLM serving, because cold starts are tens of seconds).

Failure modes. What happens when the upstream auth service is down? When a region goes dark? When the model returns garbage? The senior move is to name three concrete failure modes and say what the system does for each.

Cost controls. Where is the budget going? What levers can the candidate pull if the CFO asks for 20% cost reduction? Quantization, smaller models, aggressive caching, lower concurrency.

The on-call experience. What does the runbook look like? Which alerts wake someone up? Which are just advisory?

This phase is short but high-signal. It is also where the candidate should volunteer a caveat or two the interviewer hasn’t asked about yet — evals, data privacy, compliance. Showing that the candidate thought about the things outside the interview’s frame is the strongest possible close.

114.8 The meta-game — showing tradeoffs without being asked

Senior candidates volunteer tradeoffs. Mid-level candidates hide them. The difference is visible in every sentence.

A mid-level candidate says: “I’ll use FAISS with HNSW.”

A senior candidate says: “For the vector index, I’d pick HNSW over IVF because we need low p99 latency and the index fits in RAM — but if the index grew past 200 GB, I’d switch to IVF with product quantization to trade recall for memory. For 10 TB of raw documents, we’ll probably need disk-based ANN like DiskANN, and I’d benchmark both before committing.”

The second answer is longer, but it shows the candidate knows the frontier. Every design choice in the interview is a point on a Pareto curve, and the senior candidate names the axes and the point.

The ritual: for each big decision, say “I considered X and Y; I chose Y because reason; here’s when I would switch to X.” Three sentences, roughly. Do this for the gateway (Envoy vs custom), the serving runtime (vLLM vs SGLang vs TensorRT-LLM), the autoscaler (KEDA vs HPA vs Knative), the feature store (Feast vs Tecton vs custom), the vector index (HNSW vs IVF vs DiskANN), and one or two others. Six to eight explicit tradeoffs across the hour is about right. More is overkill; fewer looks like rehearsal.

The meta-game also includes volunteering the thing you don’t know. A senior candidate says: “I’ve never shipped a 10 TB DiskANN index in production — I know the paper, but my real experience tops out at about 500 GB. So I’d want to build a load test before committing.” This is a net positive signal: it demonstrates calibration, and it removes any trap-door question where the interviewer can corner the candidate on operational details they couldn’t have.

114.9 How to push back on under-specified questions

Interviewers leave questions open on purpose. The worst response is to silently invent requirements; the best response is to ask sharp clarifying questions and then ratify the answers with a one-sentence summary.

But sometimes the interviewer refuses to commit. “You decide — what would make this more interesting?” This is a test. The correct move is to commit to a specific scenario, out loud, and anchor the rest of the design to it:

“Okay — I’m going to assume this is a consumer chat app with 1M DAU, not an enterprise internal tool. That means I’m optimizing for cost per token over strict SLAs, the quality target is ‘as good as GPT-4o-mini at half the cost,’ and I don’t need regional failover on day one. If any of that is wrong, tell me now and I’ll rewind.”

This does three things: it commits to a scenario (which lets the design proceed), it reveals the candidate’s intuition about which scenario is interesting (which is itself a judgment signal), and it gives the interviewer a cheap way to redirect. It is the right shape for every open-ended ML systems question.

The wrong move is to design for the union of all possibilities. A design for “every possible workload” is a design for no workload. Senior engineers commit.

114.10 Common failure modes in each phase

The ways candidates blow up, ranked by frequency:

Clarify phase. (a) Asking zero clarifying questions. (b) Asking 20 clarifying questions and running out the clock. (c) Asking fuzzy questions like “tell me more about the problem” instead of specific ones like “what’s the p99 latency target?”. (d) Not writing down the answers.

Estimate phase. (a) Skipping it entirely and jumping to a diagram. (b) Getting the unit conversions wrong (tokens/sec vs tokens/hour vs tokens/day). (c) Not knowing the per-GPU throughput for common models. (d) Not naming the assumptions out loud.

Design phase. (a) Drawing a diagram from a blog post verbatim, with no customization to the question. (b) Missing a critical box (no safety layer, no metering, no telemetry). (c) Over-labeling — every box with five sub-boxes, no narrative flow. (d) Not naming the technologies.

Drill phase. (a) Picking the wrong component (drilling into the gateway when the interviewer cares about retrieval). (b) Surface-level depth — naming concepts without explaining them. (c) Ignoring interviewer interruptions instead of treating them as hints. (d) Running out of time because the drill sprawled.

Ops phase. (a) Skipping it. (b) Saying “we’d monitor it in Prometheus” and nothing else. (c) Not knowing what an SLO or an error budget is. (d) Not mentioning rollout or canaries.

Every one of these failures is preventable with practice. Chapters 114–119 are the practice reps.

114.11 The mental model

Eight points to take into Chapter 115:

  1. The interviewer grades process, not answers. Show judgment, numerate fluency, tradeoffs, ops awareness, and recovery.
  2. The 45 minutes has five phases with a fixed time budget. Announce the plan early and keep the clock visible.
  3. Clarify before designing. Ask four to six sharp questions; write down the answers; get them ratified.
  4. Estimate in public. Walk the chain from users → tokens → GPUs → dollars out loud, naming assumptions.
  5. High-level design is one diagram with labeled technologies. Ten minutes. Move on.
  6. Drill is the deepest phase. Pick one or two components and go as deep as the interviewer allows, treating interruptions as hints.
  7. Ops phase is non-negotiable. Rollouts, monitoring, cost levers, failure modes, on-call.
  8. Volunteer tradeoffs constantly. Every big choice should be three sentences: considered X and Y, chose Y because reason, here’s when to switch.

Chapter 115 is the back-of-envelope numerical kit — the raw material for the estimate phase and every drill that follows.


Read it yourself

  • Alex Xu, System Design Interview — An Insider’s Guide (Volumes 1 and 2). Not ML-specific, but the framing of the clarify-estimate-design loop is the canonical treatment.
  • The “Grokking the System Design Interview” problem list for the broad frame, stripped of ML context.
  • The Google SRE book, chapters on SLIs/SLOs and postmortems — this is the operational vocabulary senior interviewers test on.
  • Martin Kleppmann, Designing Data-Intensive Applications, chapters 1–3. The systems-thinking foundation every ML interview assumes.
  • Jeff Dean’s “Numbers every programmer should know” slide (2012) — the ancestor of Chapter 115’s numerical kit.
  • The “Latency Numbers Every Programmer Should Know” gist on GitHub, updated annually.

Practice

  1. Write down the five phases of the framework from memory, with the time budget for each and one sentence describing what the candidate produces in each phase.
  2. Pick any question from Chapters 114–119 without reading it. Spend exactly five minutes on the clarify phase and write down your six questions. Compare to what the chapter actually asks.
  3. For a chatbot serving 10M MAU with 3 sessions per day and 800 tokens per session, do the full estimation chain out loud, end to end, in under two minutes. Write down each intermediate number.
  4. Draw the reference architecture from §114.5 on a whiteboard from memory. Label every box with a technology choice.
  5. Pick one component in that architecture and drill it for 10 minutes, out loud, with no notes. Hit: job, request lifecycle, dominant constraint, optimizations, failure modes, metrics.
  6. List three common failure modes from §114.10 that you are most likely to fall into under pressure. Write a one-line mitigation for each.
  7. Stretch: Find a senior engineer and do a mock interview on the chatbot-for-1M-users problem (Chapter 116) using only this framework. Record the audio. Listen back. Count the number of times you volunteered a tradeoff without being asked; aim for six to eight.