Company-specific prep and mock interview transcripts
"Every interviewer is asking the same question through a different lens. Your job is to figure out which lens they're holding before you open your mouth"
There is one ML systems design question. There are fifty versions of it. “Design a chatbot” and “design an inference API” and “design a RAG system” are surface variations. The real variation — the one that separates candidates who get offers from candidates who don’t — is about which company is asking and what that company cares about most.
A generic textbook answer to “design an LLM serving platform” will get you through a mid-level screen at a company that doesn’t care much about ML systems depth. It will fail you at OpenAI, where the interviewer has written the infrastructure you’re describing and wants to probe your understanding of chunked prefill scheduling at scale. It will also fail you at Amazon, where the interviewer is thinking about the cost-per-inference number and wants to hear about spot instances and quantization tradeoffs before you spend thirty minutes on the architecture diagram.
The best candidates don’t just know the reference architecture. They know who is interviewing them, what that company ships in production, what keeps that company’s on-call engineers awake at 3 a.m., and what getting promoted at that company actually requires. They walk into the room and adapt the first three minutes of their answer to send those signals back.
This chapter is about doing that adaptation systematically.
The first section covers fourteen companies — their emphasis areas, role structures, leveling, and the questions they’re known to ask. Read the section for the companies you’re actively targeting; skim the rest for pattern recognition. A candidate who interviews at OpenAI after reading the Meta section will still know more than one who read nothing.
The second section covers the role taxonomy — ML engineer versus ML systems engineer versus applied scientist versus research engineer — because the interview loop differs meaningfully by role even within the same company.
The remaining four sections are mock transcripts. Three full-length simulations: an LLM API service, a RAG system, and a multi-tenant fine-tuning service. They are meant to be read as transcripts, not essays — with realistic back-and-forth, imperfect answers, mid-flight corrections, and an explicit debrief at the end of each. The fifth section shows you how to self-grade your own transcript.
One important framing before starting: the framework from Chapter 114 applies at every company. What changes is the vocabulary you emphasize and the tradeoffs you surface first. At Meta, you surface sharding and multi-tenancy early. At Netflix, you surface A/B infrastructure. At Stripe, you surface idempotency. The skeleton is the same; the first five words are different.
130.1 Company-specific signals: what different companies emphasize
OpenAI
Roles. Infrastructure Engineer, ML Systems Engineer, Research Engineer. The “Applied AI Engineer” track is essentially a product-engineering role and interviews differently from the core serving infrastructure track. Know which loop you’re in.
What they emphasize. Inference internals above everything else. The interviewers have shipped vLLM, RLHF pipelines, and the ChatGPT serving layer. If you say “continuous batching” without being able to explain the scheduler’s decision loop, they will notice. Expect deep questions on KV cache management, speculative decoding, tensor parallelism tradeoffs, and the gap between benchmark throughput and production throughput. Multi-tenant isolation matters because the API is a public product, but the depth questions are almost always inference-side.
Safety and evaluation come up seriously. OpenAI has an entire policy-and-safety org, and system design interviews for senior roles will ask about the evaluation pipeline — how you gate a model release, what an LLM-as-judge setup looks like at scale, how you handle adversarial prompts in a real serving path.
Leveling. IC4 (mid) through IC6 (staff). No traditional L3/L4/L5 nomenclature — they use numeric levels. Comp is equity-heavy. IC5 can clear $500k–700k TC at current valuation.
Signature question. “Design a batching scheduler that minimizes latency at 95th percentile while maximizing GPU utilization.” They want you to walk through preemption, chunked prefill, KV cache pressure as a scheduling signal, and the tradeoff between greedy batch composition and fairness.
L4 vs L5 delta. L4 gives a correct reference architecture. L5 surfaces the preemption-vs-throughput tradeoff unprompted, names the failure mode where long requests stall the scheduler, and proposes a concrete metric (preemption count per minute) as the autoscale signal.
Anthropic
Roles. Infrastructure Engineer, Research Engineer, Model Release Engineer. The infrastructure track is focused on the Claude API serving layer; research eng is a mix of training infrastructure and interpretability tooling; model release is unusual and worth understanding if you’re interviewing for it.
What they emphasize. Safety, evaluation, and inference, roughly in that order. The safety emphasis is organizational — Anthropic genuinely treats it as a core engineering constraint, not a compliance checkbox. Design questions will probe how you think about adversarial users, prompt injection, jailbreaks in the serving path, and false-positive rates on moderation. Evaluation infrastructure — how you build golden sets, run LLM-as-judge at scale, compare model generations — is stressed harder here than almost anywhere else.
Inference depth matters on the infra track. Anthropic has pushed hard on long-context (100k–200k tokens) serving, which creates specific KV cache management problems that most other companies don’t face yet. If you interview on the infra team, go deep on long-context KV cache eviction strategies.
Leveling. ICx (similar to OpenAI). Equity-heavy startup comp, tightly correlated to funding round valuation. Front-load your questions about vesting cliff and liquidity.
Signature question. “How would you design the eval pipeline that gates a new Claude model release?” They want to hear about golden set construction, LLM-judge bias, contamination, the difference between automated evals and human red-teaming, and what happens when a model passes automated evals but fails red-team.
IC4 vs IC5 delta. IC4 designs the pipeline correctly. IC5 identifies that the golden set is contaminated over time as model outputs leak into fine-tuning data, proposes a rotation strategy, and notes that the judge model has its own version-dependency problem.
Google DeepMind / Google Cloud ML
Roles. Software Engineer (ML), Research Engineer, Production Engineer. The SWE-ML loop at Google is the deepest on distributed systems. The PE (production engineer) loop focuses on reliability and infra. Google has a single global ladder; the ML SWE is the most commonly interviewed path for systems candidates.
What they emphasize. Distributed systems depth. Spanner, Colossus, Borg — these are the intellectual heritage. Even if you’re designing an LLM serving system, an interviewer who shipped TF Serving or Pathways will expect you to talk about distributed scheduling, tensor parallelism across pods, failure domains, and how you handle partial failures in a 1000-GPU training run. Scale and multi-tenancy matter because GCP is a hyperscaler, and isolation requirements are different from a single-product company.
Leveling. L5 (SWE3) through L8 (distinguished). External hires typically target L5 or L6. L6 is where the difficulty asymptote starts — Google’s L6 bar is one of the highest in the industry for systems roles.
Signature question. “You are redesigning the prefill/decode disaggregation layer for a 1T-parameter model. How do you partition work across two GPU pod types?” They want to hear about routing protocols, KV cache transfer between prefill and decode workers, and the bandwidth requirements of moving multi-hundred-GB KV caches between pods. This is the real PD disaggregation question, not the toy version.
L5 vs L6 delta. L5 correctly describes the two-pool architecture and the routing problem. L6 quantifies the KV transfer bandwidth, identifies the tail latency problem when the decode pool is slower than the prefill pool, and proposes a work-stealing mechanism with bounded queue depth.
Meta (GenAI)
Roles. Production Engineer, ML Engineer, Research Engineer (LLM Infra). Meta’s PE role is their most valued infra track — it’s not a lesser SWE, it’s a specialized systems discipline. ML Eng covers training and serving for GenAI products (Meta AI, Llama serving).
What they emphasize. Scale and multi-tenant isolation, more than anywhere else. Meta serves billions of users. The question is not “how do you build this” but “how do you build this when you can’t afford the bug blast radius to touch more than 0.01% of users.” Sharding strategies, namespace isolation, cross-datacenter replication, and the failure-domain thinking that comes with Facebook-scale infrastructure.
Cost. Meta is religious about efficiency. They built custom silicon (MTIA) and custom networking (RoCE) because at their scale, a 5% efficiency gain is worth a team. Expect cost questions to come up early. “You said you’d use H100s — when do you switch to MTIA, and what changes?”
Signature question. “Design the multi-tenant model-serving platform for Meta AI that serves 100M+ DAU across eight Llama variants.” The interesting part is the routing logic: which model gets this request, how do you enforce per-tenant quotas without a global bottleneck, and how does a single bad tenant not degrade the whole fleet?
E5 vs E6 delta. E5 (equivalent to senior SWE) designs the correct multi-tenant isolation. E6 proposes per-tenant capacity headroom budgets with a mechanism to reclaim unused capacity in real time, identifies the thundering herd problem when a new model version hot-deploys to all tenants simultaneously, and recommends staggered rollout at the tenant-shard level.
Amazon / AWS (Bedrock, SageMaker)
Roles. Software Development Engineer (SDE2/SDE3), Applied Scientist, Senior Solutions Architect (for enterprise-facing roles). Amazon interviews for Bedrock engineering and SageMaker serving are system-design-heavy.
What they emphasize. Cost and operational simplicity, more than inference sophistication. Amazon is a retailer with a cloud business. The engineering culture values frugality. If your design involves an expensive component, you will be asked “what’s the cost per call?” before the interviewer asks anything about latency. The Leadership Principles — especially “Customer Obsession” and “Frugality” — are not just HR material; interviewers literally evaluate whether you can connect a design decision back to a principle.
SageMaker interviewers want to hear about the operational surface area. How do you version models? How do you roll them back? What does the customer-facing API for model deployment look like, and how is it different from the internal API? Enterprise constraints — VPC isolation, private endpoints, compliance certifications — come up in SageMaker design interviews more than anywhere else on this list.
Leveling. SDE2 (mid), SDE3 (senior), Principal (staff+). Amazon’s SDE3 bar is real but takes longer to clear than at OpenAI or Google because the LPs add a second dimension to the interview loop.
Signature question. “A customer using SageMaker Endpoints says their latency doubled after enabling multi-model endpoints. How do you debug this and what do you change?” The answer involves model-load cold starts, the warm-pool behavior of MME, and the difference between load balancing at the gateway and load balancing at the instance level.
SDE2 vs SDE3 delta. SDE2 identifies the cold-start issue and proposes pre-warming. SDE3 quantifies the expected load time, identifies that the load-balancing algorithm is unaware of in-model-load state and proposes a weighted routing signal based on current resident models per instance, and checks whether auto-scaling is configured to use request latency as a custom metric rather than CPU.
Microsoft (Azure OpenAI, Copilot)
Roles. Software Engineering (SWE), Principal Software Engineer (PSE), Partner Software Engineer. The Azure ML / Azure OpenAI track is the most systems-heavy; the Copilot track mixes systems with product-side ML.
What they emphasize. Enterprise constraints and distributed systems. Microsoft’s customers are Fortune 500 companies with strict compliance requirements: data residency, audit logs, customer-managed keys, VNet injection. “How does your design handle a customer who cannot allow their data to cross a geographic boundary” is not a hypothetical question at Microsoft — it’s a P0 requirement that shapes architecture.
Cost optimization at enterprise scale. Azure AI services are priced on consumption, and the margin on Copilot seats is a real business metric. Expect questions about batching, caching, model routing to smaller models for simple queries, and the value of pre-generating common responses.
Leveling. SWE (L59), Senior SWE (L63), Principal (L65). L63 is roughly Google L5/Meta E5. Principal is the inflection point; getting there from outside requires a very strong loop.
Signature question. “Design the multi-tenant rate limiting layer for Azure OpenAI that enforces both per-customer-per-deployment and per-customer-cross-deployment quotas, with millisecond enforcement and audit-trail compliance.” This is a distributed rate-limiting problem with consistency requirements, and the interesting part is whether the candidate knows that global consistency in a rate limiter is expensive and proposes a local-counter-with-periodic-sync architecture instead.
L59 vs L63 delta. L59 proposes a Redis-based centralized rate limiter. L63 identifies the single-point-of-failure and the tail latency cost, proposes per-region local counters with an async reconciliation step, and explicitly addresses the compliance requirement that over-quota requests are logged with enough detail to audit.
Netflix
Roles. Software Engineer (ML Platform), Senior Software Engineer, Staff Software Engineer. Netflix’s ML platform team is smaller than you’d expect; the bulk of ML engineering is in the recommendation and experimentation teams.
What they emphasize. Streaming infrastructure and A/B testing. Netflix’s core loop is: produce content → recommend it → measure whether the recommendation was good → iterate. The recommendation-and-ranking systems are the central ML asset. A/B infrastructure — how you split traffic, measure long-horizon effects, and avoid cannibalization between experiments — is the dominant systems concern.
For GenAI work (which is growing at Netflix), the emphasis shifts to content understanding, audio/visual model serving, and offline processing pipelines. Latency requirements are looser than a chat product because recommendations aren’t in the interactive path.
Leveling. Netflix has an unusual structure: Senior Engineer is the target role for most external hires, and there is no SWE → Senior distinction. Total comp skews heavily toward cash relative to peers.
Signature question. “Design the experimentation platform that can run 200 simultaneous A/B tests on Netflix’s recommendation system without contaminating the treatment/control assignment.” The interesting part is what happens when users are in overlapping experiments, and how you handle the long-horizon measurement problem (a recommendation’s effect on retention takes 90 days to measure).
SE vs SSE delta. SE designs the assignment and measurement infrastructure correctly. SSE identifies the variance-reduction opportunity from CUPED or similar techniques, notes that the 90-day measurement horizon interacts with subscriber churn in a way that biases the metric, and proposes a holdout cohort design that distinguishes treatment effect from cohort effect.
Stripe
Roles. Software Engineer, Senior Software Engineer, Staff Engineer. ML roles at Stripe sit under the Fraud and Risk platform and, increasingly, the Radar / Sigma product teams. Financial compliance shapes everything.
What they emphasize. Reliability and idempotency above almost everything. A double-charge in a payment system is a business-ending event. Every API that touches money at Stripe has an idempotency key requirement, and the interview culture extends this to ML systems: if your model makes a decision (decline a transaction, score a risk), that decision must be auditable, reproducible, and retryable without side effects.
Latency is important but secondary to correctness. Stripe’s fraud scoring runs in ~50 ms p99, but the hard constraint is not missing 99.999% of legitimate transactions. False-positive cost is very concrete at Stripe because declined transactions are tracked to revenue.
Leveling. SWE2 (senior), Staff. Two-level structure externally; Staff is the inflection.
Signature question. “Design the feature store that serves Stripe’s real-time fraud model. The model makes a decision within 50 ms, but some features (like ‘has this merchant been involved in chargebacks in the last 30 days’) take 200 ms to compute fresh. How do you handle freshness and staleness?” The interesting part is how you pre-materialize the expensive features, how stale is acceptable, and what the audit trail looks like when a decision is made on a stale feature value.
SWE2 vs Staff delta. SWE2 proposes a pre-materialized feature cache with TTL. Staff identifies that a stale feature that’s wrong-direction (e.g., a merchant cleared of chargebacks but still marked dirty) increases false positives, proposes a write-through invalidation trigger tied to the chargeback resolution event, and adds a compliance field to the decision log that records the feature value and its freshness at decision time.
Databricks
Roles. Software Engineer, Senior Software Engineer, Staff Engineer, Solutions Architect. The engineering track is the deepest; SA interviews less so.
What they emphasize. Data plane architecture and the lakehouse pattern. Databricks is Delta Lake / Spark / MLflow at the core. The engineering culture is built around “data infrastructure should be simple enough for a data scientist to operate.” System design interviews will probe your understanding of the Delta transaction log, Z-ordering, file compaction strategies, and how you design a feature store that sits on top of a Delta table without sacrificing write throughput.
For GenAI workloads, the emphasis is on the Mosaic AI platform (formerly MosaicML). Model training, fine-tuning pipelines, and serving via model serving endpoints. Cost-per-training-run and data pipeline throughput are dominant themes.
Leveling. IC4 (senior), IC5 (staff), IC6 (principal). The IC4-to-IC5 promotion bar is high; the company values depth in the data plane over breadth.
Signature question. “Design a multi-tenant fine-tuning platform on top of the Databricks lakehouse where customer data never leaves their workspace, fine-tuned weights are isolated per customer, and the customer can roll back to any prior checkpoint.” The interesting constraints are workspace-level data isolation, checkpoint versioning in Delta, and GPU job scheduling that respects workspace-level quotas.
IC4 vs IC5 delta. IC4 designs the pipeline correctly. IC5 identifies that cross-workspace training data contamination is possible if the base model checkpoint references shared catalog tables, proposes strict Unity Catalog row-level security at the data-loading layer, and notes that the rollback requirement implies append-only checkpoint storage with manifest-based versioning rather than in-place overwrite.
xAI
Roles. ML Infrastructure Engineer, Research Engineer, Software Engineer. The company is small (relative to its ambitions) and moves fast. The interview culture reflects this: less structure, more “can you actually build it.”
What they emphasize. Inference internals and scale. Grok serves a large real-time workload with very long context windows (up to 1M tokens via Grok 1.5). The scaling challenges at xAI are among the most acute in the industry: running models at Colossus-scale (100k H100s) with inter-GPU all-reduce latency as a dominant bottleneck. If you know the 3D-parallelism literature (TP × PP × DP), xAI will give you room to go deep.
Leveling. Flat-ish title structure. Total comp is competitive but equity liquidity is speculative.
Signature question. “At 100k H100s, all-reduce on a 1T-parameter model takes 50 ms per step. What are your options?” They want to hear about ring all-reduce vs tree all-reduce, gradient compression, overlap of compute and communication, ZeRO Stage 3, and the point at which you give up on synchronous all-reduce and switch to asynchronous or local SGD.
Mid vs senior delta. Mid names the techniques. Senior knows the bandwidth math, can estimate when communication starts dominating compute, and can describe the gradient staleness tradeoff in async training.
Cohere
Roles. ML Engineer, Research Engineer, Solutions Engineer. Solutions Engineering at Cohere is heavier on systems than at most companies — they serve enterprise customers directly and the SE role involves significant design work.
What they emphasize. Enterprise retrieval systems and fine-tuning for business workloads. Cohere’s product is RAG + embeddings + fine-tuning for the Fortune 500. Design interviews probe knowledge of hybrid retrieval (dense + sparse), reranker architectures, and how you adapt a general-purpose embedding model to a domain via fine-tuning without catastrophic forgetting.
Leveling. Standard SWE ladder. Startup equity.
Signature question. “Design the retrieval pipeline for a customer with 50 million legal documents. They need sub-500ms p99 latency, freshness within 1 hour of a document being added, and the index must support semantic search plus exact keyword match on metadata.” The candidate needs to design the dual-index (dense + BM25), the freshness pipeline (event-driven ingestion vs batch rebuild), and the reranker that merges the two result sets.
Mistral
Roles. Research Engineer, Infrastructure Engineer, ML Engineer. Paris-headquartered but hiring globally. Smaller team than Cohere; emphasis on research-to-production.
What they emphasize. Inference efficiency and open-weights deployment. Mistral pioneered mixture-of-experts at the openly-accessible weight tier (Mixtral). Interview questions probe MoE-specific serving challenges: expert routing, expert-parallel serving, load imbalance across experts, and the activation sparsity problem at low batch sizes.
Signature question. “Mixtral 8×7B has 8 experts per layer, but each token only uses 2. At batch size 1, how does this compare to a dense 7B model on GPU? At batch size 512, what changes?” This is a concrete question about expert activation patterns, memory-bound vs compute-bound serving, and when MoE helps vs hurts.
Cursor / Replit (AI coding tools)
Roles. Software Engineer, ML Engineer, Infra Engineer. These are product-focused companies where ML engineering sits close to the product surface. The interview culture skews practical.
What they emphasize. Streaming latency and context management. An AI coding tool needs sub-100ms time-to-first-token because the user is waiting for a completion inline. The prompt construction — which files go in context, in what order, with what truncation strategy — is the real IP. Design interviews often ask about context window management: how do you decide which parts of a large codebase to include in the prompt for a given query, and how do you keep that context fresh as the user edits files.
Signature question (Cursor). “Design the context manager that decides what goes into the LLM prompt for a user who has 50 files open in their editor. The context window is 32k tokens. How do you select, rank, and truncate?” The interesting part is the retrieval problem (which files are relevant?) and the truncation strategy (what do you cut when context is full?).
Signature question (Replit). “Design the execution sandbox that runs user-generated code from an LLM completion. How do you isolate execution, prevent resource abuse, and stream output back in real time?” This is a systems security problem as much as an ML problem.
Hugging Face
Roles. ML Engineer, Research Engineer, Infrastructure Engineer. Hugging Face interviews for both the Hub platform (storage and serving of hundreds of thousands of models) and Inference Endpoints (the hosted serving product).
What they emphasize. Open-source ecosystem integration and multi-framework serving. HF serves models in PyTorch, JAX, and ONNX, across transformers, diffusers, and custom architectures. The serving challenge is that you cannot know in advance what model a user will upload — the serving layer must handle arbitrary architectures within broad guardrails.
The Hub itself is a fascinating distributed system: 500k+ model repositories, model weight sharding across their LFS-compatible storage layer, and cold-start problems when a suddenly-popular model gets 10k download requests in an hour.
Signature question. “Design the Inference Endpoints product. A customer uploads any HF-compatible model (up to 70B parameters) and within 5 minutes has a live HTTP endpoint they can query. How do you make that 5-minute deployment reliable?” The candidate must cover model conversion/validation, GPU allocation, container orchestration, and the failure mode where the customer’s model has a memory bug that OOMs on first inference.
130.2 Role taxonomies: ML engineer vs ML systems engineer vs applied scientist vs research engineer
These four titles look different on job descriptions and feel different in the interview loop. Candidates who treat them as synonyms go into loops under-prepared or over-prepared.
ML Engineer
The most common title, and the most variable in meaning. At a company like Meta or Google, “ML engineer” typically means someone who trains and deploys production models — not pure research, not pure infra. The day-to-day is split between feature work (improving model quality) and infrastructure work (improving the training and serving stack that supports that quality). The interview loop reflects this split: one or two system design rounds plus one or two ML depth rounds (loss functions, evaluation, training stability).
The most important thing to understand about ML engineer interviews: they are NOT the same as software engineer interviews with ML vocab sprinkled in. Interviewers expect you to know the specific failure modes of model training (loss spikes, gradient norm explosion, silent data quality regressions) at the same depth as you know systems design. A candidate who can design a serving platform but can’t explain why a cosine learning rate schedule outperforms step decay for a given workload will stall at the ML depth rounds.
ML Systems Engineer
This title is more common at infrastructure teams within ML companies. The day-to-day is almost entirely systems: serving runtimes, cluster schedulers, training infrastructure, observability pipelines. The interview loop is closer to a senior SWE loop — two to three system design rounds — with the domain anchored in ML-specific components (GPU scheduling, model deployment pipelines, KV cache management).
The important distinction: ML systems engineers are not expected to have strong opinions about loss functions or model architecture. They are expected to have opinions about batch sizes, memory management, latency budgets, and failure modes in distributed GPU workloads. If you come from a pure software engineering background but have spent time on ML infrastructure, this is your loop.
Applied Scientist
Applied scientist roles sit at the boundary of research and production. The typical day-to-day involves running experiments with real data, publishing internal (sometimes external) findings, and working with ML engineers to deploy models to production. The interview loop has a strong research component — often a deep-dive on a paper you’ve written or a system you’ve built — alongside a system design round and a coding round.
The distinguishing feature of applied scientist interviews: you will be asked about your research process. What was the problem? What did you try that didn’t work? What was the insight that made it work? The interviewer is looking for scientific rigor, not just engineering execution. A candidate who has built production systems but has never run rigorous experiments will struggle in the applied scientist loop even if their engineering skills are excellent.
The system design round in an applied scientist loop is usually lighter than in an ML engineer loop. The interviewer wants to confirm you can think about production constraints, not that you can design a complete serving platform. A solid high-level design with clear tradeoffs is enough; a 45-minute serving internals drill is over-indexed.
Research Engineer
Research engineering is the most specific title. At companies that use it (Anthropic, OpenAI, DeepMind), the role is specifically: make research happen at scale. This means training infrastructure for large experiments, evaluation pipelines, dataset processing at billion-document scale, and tooling that a research scientist can use without a PhD in systems engineering.
The interview loop for research engineer is unusual: it often includes a research paper deep-dive (read a recent company paper, explain the technical contribution, critique it) alongside system design and coding. The interviewer is looking for someone who can operate fluently in both worlds — implement a clean distributed system AND understand why the model is behaving a certain way.
The day-to-day distinction from applied scientist: research engineers rarely run experiments themselves. They build the infrastructure that makes experiments 10× faster for the scientists. The accountability is for infrastructure throughput, not for the quality of the science.
What actually changes in the interview loop
| Loop element | ML Eng | ML Sys Eng | Applied Sci | Research Eng |
|---|---|---|---|---|
| System design rounds | 2 | 2–3 | 1 | 1–2 |
| ML depth rounds | 1–2 | 0–1 | 2 | 1 |
| Research/paper round | 0 | 0 | 1 | 1 |
| Coding rounds | 1–2 | 1–2 | 1 | 1–2 |
| System design focus | serving + training | serving internals | high-level only | training infra |
The coaching implication: if you’re interviewing for ML systems engineer, double down on serving internals, skip the loss function prep. If you’re interviewing for applied scientist, read your papers carefully and practice the research narrative, and keep the system design prep at a “know the reference architecture well” level.
130.3 Mock transcript 1: “Design an LLM API service for 100k RPS”
Setup. This is a senior ML systems engineer interview at a company like OpenAI or Anthropic. The interviewer is a staff engineer who has shipped LLM serving infrastructure. The candidate is targeting an IC5/L5 role.
Reference architecture the candidate draws:
graph LR
Client["Client"] --> GW["API Gateway\nauth · rate limit · OTel"]
GW --> ADM["Admission + Router\nquota · model select"]
ADM --> FLEET["LLM Serving Fleet\nvLLM · TP=4 · FP8"]
FLEET <--> KVC["KV Cache Layer\nLMCache + shared prefix"]
FLEET --> POST["Output Filter\nstreaming-aware scan"]
POST --> METER["Metering\nKafka at-least-once"]
METER --> OBS["Observability\nPrometheus · Loki · Tempo"]
style FLEET fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The LLM serving fleet and its KV cache layer together consume 85% of the operating budget — every other box exists to protect and feed them.
INTERVIEWER: Okay, so — the problem I want to work through is: design an LLM API service that can handle 100,000 requests per second. Standard stuff, but I want to see how you think through it.
CANDIDATE: Sure. Before I start drawing anything, I want to make sure I understand the shape of those 100k RPS. Can I ask a few questions?
INTERVIEWER: Go ahead.
CANDIDATE: First, is this 100k API requests — so 100k HTTP calls — or 100k tokens per second? Those are very different capacity problems.
INTERVIEWER: 100k HTTP requests per second. Each is a chat/completions call.
CANDIDATE: Okay. And what’s the token profile? Is this short queries, long queries, variable? Because 100k RPS at 50 output tokens is very different from 100k RPS at 1000 output tokens.
INTERVIEWER: Assume the typical ChatGPT-like distribution — mostly short, some long. Let’s say average 400 output tokens per request.
CANDIDATE: Got it. So at 400 output tokens per request, 100k RPS, that’s 40 million output tokens per second. That’s the number I’m going to design to. Next question: latency target. TTFT? TPOT?
INTERVIEWER: TTFT under 500ms p95. TPOT under 30ms per token.
CANDIDATE: Okay, 30ms TPOT is pretty tight — that’s going to drive our batch size decisions. Last clarifying question: single model or multi-model routing?
INTERVIEWER: You can assume one 70B model for now. We might discuss routing later.
CANDIDATE: Perfect. So let me do the capacity math, and then I’ll draw the architecture. Bear with me, I’m going to talk through this out loud.
100k RPS. 400 output tokens per request. That’s 40M output tokens per second peak. For a 70B model at FP8 on H100s — I’m going to use 1500 output tokens per second per GPU as the realistic decode throughput with continuous batching. That gives me 40M / 1500 = roughly 27,000 H100s for decode alone.
INTERVIEWER: That seems like a lot.
CANDIDATE: It is a lot. Let me sanity check. 100k RPS × 400 tokens = 40M tokens/sec. At 1500 tokens/sec per GPU, that’s 26,667 GPUs. Yeah, we’re looking at north of 25,000 H100s. At current spot pricing around $2/hr, that’s 25k × $2 × 720 hours = roughly $36M per month just for inference compute.
The alternative framing: if someone says “100k RPS” without specifying token count, they often mean 100k short API calls — like a classification endpoint. But for a chat/completions endpoint with 400 output tokens, this is genuinely a massive fleet. I’ll proceed, but wanted to flag that the number is real.
INTERVIEWER: Fair, that’s a realistic check. Okay, let’s proceed. What’s the architecture?
CANDIDATE: Okay so, the architecture — let me draw the main path. Incoming requests hit an API gateway. I’d use Envoy AI Gateway here, OpenAI-compatible front door, JWT auth, rate limiting with a token bucket per client. From there, requests go to an admission and routing layer. The admission layer enforces per-tenant quotas and does load shedding under pressure — I’ll come back to that. Then requests go to the LLM serving fleet. I’ll put a KV cache layer alongside it. The output goes through a streaming-aware safety filter, then to a metering sidecar, and all of this is feeding into an observability pipeline.
The serving fleet is the money. Let me be specific: vLLM, tensor parallelism 4 for the 70B model, FP8 quantization. Why TP=4 instead of TP=2? At TPOT 30ms, I need decode steps to be fast. TP=4 on H100s with 70B FP8 should get per-step latency around 25-28ms, which gives me margin. TP=2 would push closer to 40ms, which violates the SLO.
INTERVIEWER: Hmm. That’s actually a good point, but I want to push on something. At TP=4, you’re using 4 GPUs per request to serve one request at a time, right? Doesn’t that tank your throughput?
CANDIDATE: Good pushback. No — TP=4 means 4 GPUs share the model weights, not that each request uses 4 GPUs exclusively. With continuous batching, you’re running many requests in parallel in the same forward pass. The per-step latency comes from the model architecture and the all-reduce latency, not from the number of in-flight requests. So at TP=4, I might have 64 requests batched together in one forward pass, and all 64 benefit from the 4-GPU compute. The throughput at TP=4 is slightly lower than TP=2 because all-reduce overhead grows, but the per-step latency is better, which is what the 30ms TPOT SLO requires.
INTERVIEWER: Right, okay, that tracks. So back to the design — 27,000 GPUs. How do you manage a fleet that size?
CANDIDATE: The honest answer is you can’t manage it as a single fleet. At that scale you need cell-based architecture. I’d partition into cells of maybe 1,000-2,000 GPUs each, one cell per availability zone, with a cell acting as a fault domain. The admission layer does cell-aware routing — it picks a healthy cell, routes the request in, and if a cell is degraded, it routes around it. A cell going down doesn’t take down the whole fleet.
Within each cell, you have a pool of vLLM replicas. Each replica is a 4-H100 unit running one model shard. The number of replicas in a cell scales with demand — KEDA autoscaling on vllm:num_requests_running, target 48 requests per replica, with autoscale-up threshold at 70% to leave room for burst.
INTERVIEWER: Okay. What about the KV cache — at 100k RPS, what does your KV cache strategy look like?
CANDIDATE: This is the interesting part. At 100k RPS with a shared system prompt — and almost every API deployment has one — prefix caching is essential. The system prompt is, say, 1500 tokens. Without prefix caching, every request pays the prefill cost for those 1500 tokens. With prefix caching, each vLLM replica computes those tokens once and caches the KV, so subsequent requests start from after the prefix.
But at this scale, I’d add cross-replica prefix caching via LMCache. When a request lands on a replica that hasn’t seen this prefix yet, instead of recomputing it, the replica fetches the KV cache from LMCache — which is basically a distributed KV store backed by Redis. The cache key is the hash of the prompt prefix up to the shared system prompt boundary. Hit latency is about 500 microseconds for a Redis GET, which is much cheaper than 300ms of prefill compute.
INTERVIEWER: What if LMCache itself becomes a bottleneck at this scale?
CANDIDATE: Yeah, that’s the failure mode I’d watch for. At 100k RPS, even if only 10% of requests are cache misses that need to go to LMCache, that’s 10k RPS hitting Redis. A single Redis node tops out at maybe 100k simple gets per second, but these are big payloads — the KV cache for 1500 tokens of a 70B FP8 model is roughly 1500 × 32 layers × 2 × hidden_size / 2 … let me not do the exact math here, but it’s on the order of tens of megabytes per request. So the bottleneck is bandwidth, not ops.
The mitigation is: shard LMCache by prefix hash, so each shard handles a subset of system prompts. And maintain a local in-memory prefix cache in each vLLM replica as the L1. LMCache is L2. L1 miss falls back to L2. L2 miss falls back to compute. This hierarchy matches the cost of each level.
INTERVIEWER: Good. What happens when a cell is overloaded? Walk me through the failure mode.
CANDIDATE: Right — so the scenario is: a cell is at 90% capacity. The admission layer at the cell boundary sees vllm:gpu_cache_usage_perc creeping above 85%.
First, the cell starts shedding free-tier traffic. Paid-tier traffic still gets admitted, but free-tier gets a 429. The 429 includes a Retry-After header so the client backs off exponentially — I want back-pressure, not failure.
If the cell hits 95% KV cache utilization despite shedding free-tier, vLLM starts preempting low-priority in-flight requests. These get re-queued with a penalty and re-submitted. The user experience is a pause in streaming, not an error.
If the cell is genuinely OOM — all that fails — the cell reports unhealthy to the global admission layer. The admission layer reroutes new requests to other cells. In the worst case, where all cells are degraded, the system does graceful degradation: only premium-tier requests get served, others queue with an appropriate wait-time estimate.
The key metric I’d watch in production: vllm:num_preemptions_total. A rising preemption count is the early warning before the cache actually exhausts. I’d set an alert at 50 preemptions per minute per replica — that means the scheduler is under real pressure and autoscale should be firing.
INTERVIEWER: Last one. How do you roll out a new model version to this fleet?
CANDIDATE: Canary with shadow eval. I would not touch live traffic until the golden set passes. The rollout sequence:
First, deploy the new version to one replica in one cell — no traffic. Run a synthetic load test and the golden-set eval offline. If golden set passes (score within 2% of baseline), proceed.
Second, shift 1% of production traffic to the new replica. Watch TTFT p95, TPOT p95, error rate, and the LLM-judge quality score for 15 minutes. Auto-rollback triggers: if any SLO breaches, or if quality drops more than 3% relative. If all clear, go to 5%, then 25%, then 100%, with bake periods at each step.
The tricky part at 27k GPU scale: a 1% canary is still 270 GPUs. That’s a meaningful amount of traffic to put at risk. So the canary itself is staged — start with one cell before expanding to multiple cells. KServe’s traffic split supports this natively.
INTERVIEWER: Good. I think we’re in good shape on time. Any questions?
CANDIDATE: Yeah — one thing I’d want to explore if we had more time: the metering pipeline. At 100k RPS, that’s 100k Kafka events per second. I’d want to confirm the partition count and consumer lag SLO before signing off on the design.
INTERVIEWER: Good instinct. That’s all from me.
Interviewer notes — what this candidate did well:
- Immediately clarified the token-count distinction before doing any math. This is the difference between a real 100k RPS number and a meaningless one.
- Did the capacity math out loud and flagged when the result was surprising (27k H100s), which showed calibration, not just computation.
- TP=4 rationale was sharp — correctly explained the difference between TP degree and per-request GPU count.
- Cell-based architecture came up naturally, not as a buzzword.
- The KV cache hierarchy (L1 in-memory → L2 LMCache → compute) is a real design pattern and was described at the right level of detail.
- Failure mode walkthrough hit the four bases: shed load, preempt low-priority, reroute, graceful degrade.
- Rollout section was solid. Golden set gate, canary at multiple stages, auto-rollback triggers.
What was missing or under-developed:
- No explicit ops phase — the candidate was asked to cover it but never volunteered an SLO structure or error budget.
- The metering observation at the end was correct but came too late. Should have mentioned Kafka partition sizing earlier in the design.
- Speculative decoding wasn’t mentioned. At TPOT 30ms with a chat workload, it’s a relevant optimization.
- The prefill math was glossed over. At 100k RPS, the prefill capacity is comparable to decode capacity — it deserved a line.
Verdict: Strong IC5 performance. Gets the offer. Probably doesn’t get calibrated as IC6 without more depth in ops and prefill disaggregation.
130.4 Mock transcript 2: “Design a RAG system for 100M enterprise documents”
Setup. This is a senior ML engineer interview at a company like Cohere or Databricks. The interviewer is a staff engineer on the retrieval platform.
Reference architecture the candidate draws:
graph LR
Client["Client Query"] --> GW["API Gateway"]
GW --> QP["Query Processor\nencode · expand · rewrite"]
QP --> DENSE["Dense ANN Index\nHNSW / DiskANN"]
QP --> SPARSE["BM25 Index\nElasticsearch"]
DENSE --> MERGE["Reciprocal Rank Fusion\nor learned merger"]
SPARSE --> MERGE
MERGE --> RR["Cross-Encoder Reranker\nCohere Rerank / bge-reranker"]
RR --> CTX["Context Builder\ntruncate · cite · assemble"]
CTX --> LLM["LLM Generation\nvLLM"]
LLM --> OUT["Response + citations"]
style RR fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The reranker is the accuracy lever — it’s also the latency bottleneck; the key design constraint is fitting it in the budget while keeping p99 under 2 seconds.
INTERVIEWER: Alright. Let’s do a RAG system. 100 million enterprise documents. Design the whole thing.
CANDIDATE: Okay — so I want to start with what “enterprise documents” means. Can I ask a few things?
INTERVIEWER: Sure.
CANDIDATE: What’s the document type distribution? Are we talking mostly PDFs and Word docs, or also code, spreadsheets, emails? Because parsing is a real engineering problem and it affects chunk size.
INTERVIEWER: Mostly PDFs and Word docs. Assume they come in already parsed as text.
CANDIDATE: Good, that simplifies things. Second: what’s the freshness requirement? How quickly after a document is added does it need to be queryable?
INTERVIEWER: One hour SLA.
CANDIDATE: Okay, one hour — that rules out weekly batch rebuilds, means we need an incremental ingestion pipeline. Third: is this multi-tenant? One enterprise’s documents, or many?
INTERVIEWER: Multiple tenants. 500 enterprise customers, average 200,000 documents each. Total 100 million. Strict document isolation — tenant A cannot see tenant B’s documents.
CANDIDATE: That’s the most important constraint you’ve given me. Multi-tenant with strict isolation means the index architecture is different. I’ll factor that in. Last one: latency target?
INTERVIEWER: End-to-end query response under 2 seconds p99.
CANDIDATE: 2 seconds total including generation, or just retrieval?
INTERVIEWER: Total. 2 seconds from query to complete response.
CANDIDATE: Okay — that’s tight if you want a cross-encoder reranker plus LLM generation. I’ll flag the budget breakdown when I get to the design.
Let me do a quick capacity estimate. 100M documents. Average document is, say, 3000 words — about 5 pages. I’m going to chunk at 512 tokens with 50-token overlap. 100M documents × roughly 6 chunks each = 600 million chunks. Each chunk has a vector embedding — let’s say 768-dimensional float32, that’s 3KB per chunk. 600M × 3KB = 1.8 TB of vectors. That’s the storage floor for the dense index. Plus the BM25 inverted index, which for 600M chunks of ~512 tokens would be around 2-3× the raw text size — call it 300GB compressed.
The active query volume: if 500 enterprises have 1000 users each with 10 queries per day, that’s 5M queries per day, ~58 QPS average. Peak at maybe 3×: ~175 QPS. That’s tractable.
INTERVIEWER: Walk me through the architecture.
CANDIDATE: Okay so there’s an ingestion path and a query path. Let me do query path first since it’s the user-facing piece.
Query comes in. Goes to the API gateway — standard auth, tenant ID extraction, rate limiting. Then to a query processor. This is where I do query encoding: produce a dense embedding of the query using the same encoder the documents were embedded with. I might also do query expansion — using an LLM to generate 2-3 alternative phrasings of the query and running each through retrieval. Query expansion adds latency but significantly improves recall on enterprise queries that use domain-specific jargon.
From the query processor, I fan out to two indexes in parallel. Dense ANN index and BM25 sparse index. The dense index is for semantic similarity; BM25 is for exact keyword match. Enterprise users often know exact terms they’re looking for — contract numbers, product codes — and BM25 catches those where dense search would miss.
The results from both indexes go to a reciprocal rank fusion step. RRF is a simple, effective way to merge two ranked lists without needing to know the score scales. Score the candidate from the dense index as 1/(k + rank_dense), from sparse as 1/(k + rank_sparse), sum them.
The merged top-N — let’s say top 50 — go to a cross-encoder reranker. The reranker sees (query, chunk) pairs and produces a relevance score that’s much higher quality than the first-pass retriever. I’m using a cross-encoder here (Cohere Rerank or bge-reranker-v2) rather than a bi-encoder because cross-encoders can attend to the query and the chunk together, which is what you need for nuanced relevance. The cost: it’s slower, roughly 100ms per batch of 50 at a small model. But we’re still inside the 2-second budget.
The reranked top-10 go to a context builder. This assembles the LLM prompt: system instructions + retrieved chunks (in relevance order) + the query. It also handles truncation if the context window would overflow — typically drop the lowest-ranked chunks first.
Then LLM generation. For a 2-second total budget with reranker taking 100ms and retrieval taking 150ms, I have about 1.5 seconds for generation. At 30ms TPOT, that’s 50 output tokens. For a QA answer with citations, 50 tokens is fine. If the product needs longer answers, I need to either cut the reranker budget or stream the response and set the 2-second bar as time-to-first-token.
INTERVIEWER: Right, so you mentioned tenant isolation. 500 tenants, strict isolation. How?
CANDIDATE: So there are two approaches and I want to lay them out explicitly because they have very different tradeoffs.
Option A: one shared index, tenant-scoped filters. All 600M chunks go into one ANN index. Each chunk has a tenant_id metadata field. At query time, the retriever applies a tenant_id = X filter before returning results. HNSW with metadata filtering is supported by most vector stores (Weaviate, Qdrant). The advantage is infrastructure simplicity and better hardware utilization. The disadvantage is noisy-neighbor risk: a tenant with heavy write traffic can block reads for others, and filter performance degrades as the index grows.
Option B: per-tenant index. Each tenant gets their own vector index. 500 indexes, average 1.2M chunks each. The advantage is hard isolation — tenant A’s write load literally cannot affect tenant B. The disadvantage is that 500 indexes × 1.8 TB / 500 … wait, let me recalculate. Each tenant has 200k documents × 6 chunks × 3KB = 3.6 GB of vectors. 500 tenants × 3.6 GB = 1.8 TB total. That’s the same total, but now I have 500 small indexes rather than one large one. Small HNSW indexes are fast and cheap; this might actually be the better choice.
For 500 tenants at 200k docs each, I’d go with Option B. Per-tenant indexes, stored in a shared object store (S3), loaded into memory on a pool of vector store replicas. A routing layer maps tenant_id to the replicas holding that tenant’s index. When a query for tenant X comes in, it goes to the replica(s) holding tenant X’s index.
INTERVIEWER: What about index freshness — you said 1-hour SLA. Walk me through the ingestion path.
CANDIDATE: So the ingestion path is event-driven, not batch. A new document arrives (uploaded to S3 or pushed via API), triggers an event to an ingestion queue — Kafka. An ingestion worker picks up the event, parses and chunks the document, calls the embedding model to produce vectors, and then upserts the vectors into the tenant’s index.
The 1-hour SLA means the end-to-end latency from document upload to queryable in the index must be under 1 hour. That’s not tight — the whole pipeline can run in minutes for a single document. The challenge is backpressure at scale: if a tenant uploads 100,000 documents at once (common for onboarding), the ingestion worker pool needs to handle it without pushing other tenants’ updates beyond the SLA.
I’d handle this with per-tenant ingestion quota. Each tenant gets a max throughput of X documents per minute through the ingestion queue. New uploads beyond that get buffered in S3 with a scheduled processing queue. The scheduled queue ensures they’re processed within the 1-hour window even if the real-time path is saturated.
The embedding model itself — I’d run a dedicated embedding service (Text Embedding Inference or FastEmbed) separate from the LLM serving fleet. Embedding models are cheap and high-throughput; they don’t need H100s. A pool of A10G or L4 GPUs handles this economically.
INTERVIEWER: What’s the failure mode you’re most worried about?
CANDIDATE: The reranker. Here’s why: the reranker is on the critical path with a fixed latency budget, it’s a neural model so it can return garbage silently if it regresses, and it’s the component most likely to be the latency bottleneck when query complexity spikes.
The failure modes in order:
One: reranker latency spike. Cross-encoder rerankers have a quadratic scaling in sequence length — a query with a very long chunk can push batch latency from 100ms to 800ms. Mitigation: chunk size cap (512 tokens max), query length cap, timeout on the reranker call with fallback to the first-pass RRF ranking.
Two: reranker model regression. A new reranker version could be worse on enterprise-domain queries even if the general benchmark looks fine. This is the golden-set problem. I’d maintain a 5,000-pair annotated eval set (query, chunk, relevance label) per major vertical (legal, finance, HR) and gate every reranker update on 95% recall@10 on that set.
Three: ANN index staleness exceeding the SLA. If the ingestion pipeline backs up, newly uploaded documents are invisible until they’re indexed. I’d add an alert on ingestion queue lag — if a document has been in the queue for more than 45 minutes without processing, page the on-call team.
INTERVIEWER: Good. One last thing — how do you handle a query where the answer spans multiple documents? The model needs context from three separate docs.
CANDIDATE: Yeah, this is the multi-document context window problem. My context builder gets back the top-10 reranked chunks, and some of those chunks might be from different documents — three from doc A, two from doc B, and so on. For a 2-second budget with 50 output tokens, this is already the design: include all top-10 in the prompt, in relevance order, with citation metadata so the LLM can say “based on [document B, section 3]…”
Where it gets hard is when the answer actually requires synthesizing across documents that don’t appear close in the retrieval ranking. Like, document X has the definition and document Y has the exception, and you only find document Y if you know to look for it. That’s a multi-hop retrieval problem. For a first version, I’d skip it — single-pass retrieval covers 80%+ of queries. If it becomes a product need, the approach is iterative retrieval: run generation step 1, extract the sub-questions implied by the partial answer, run retrieval again on those sub-questions, continue until the answer converges. HyDE is a simpler version of this.
INTERVIEWER: Alright. That was solid. Any questions from you?
CANDIDATE: Yeah — one thing I’d want to design more carefully is the metadata filter architecture for when a tenant does want to filter by date or author within their own documents. I glossed over that, but it’s a real product need.
Interviewer notes — what this candidate did well:
- The multi-tenant isolation question was handled cleanly — laid out both options, quantified the storage cost for each, chose the right one for the parameters given.
- Query expansion was mentioned early, which shows real-world RAG experience (most candidates don’t bring it up).
- Latency budget breakdown was sharp: allocated 150ms retrieval, 100ms reranker, 1.5s generation — shows the candidate is thinking in terms of a real latency budget, not an abstract diagram.
- Failure modes for the reranker were specific and realistic.
- The multi-hop retrieval question was handled honestly — “skip it for v1, here’s the approach for v2” is the right senior answer.
What was missing or under-developed:
- No explicit monitoring section. What metrics does this system expose? What pages on-call?
- The embedding model fleet was mentioned but not sized. How many embedding replicas? What’s the throughput?
- The fallback from reranker to RRF was mentioned but not specified — what’s the exact trigger condition? Timeout? Error rate?
Verdict: Strong IC5. Would likely require one more depth conversation on the ops side to calibrate IC6.
130.5 Mock transcript 3: “Design a multi-tenant fine-tuning service”
Setup. This is a senior ML systems engineer interview at Databricks or AWS SageMaker. The interviewer is a principal engineer who has shipped fine-tuning infrastructure. This is a harder problem than the previous two because it combines ML training, data isolation, and serving in one system.
Reference architecture the candidate draws:
graph TD
Upload["Customer Data Upload\nS3 with customer KMS key"] --> VAL["Data Validator\nschema · PII scan · dedup"]
VAL --> JQ["Job Queue\nKafka · priority tiers"]
JQ --> SCHED["GPU Scheduler\nper-tenant quota · bin-pack · preempt"]
SCHED --> TRAIN["Training Fleet\nFSDP · DeepSpeed"]
TRAIN --> CKP["Checkpoint Store\nDelta Lake / S3 versioned"]
CKP --> REG["Model Registry\nMLflow · per-tenant namespace"]
REG --> DEPLOY["Deployment Trigger\nauto or manual"]
DEPLOY --> SERVE["Inference Endpoint\nvLLM per-tenant"]
style SCHED fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The GPU scheduler is the resource-allocation center of the whole system — a bad scheduling decision means one tenant’s job starves another’s indefinitely.
INTERVIEWER: Alright. Let’s do something a bit harder. Design a multi-tenant fine-tuning service. Enterprise customers upload their own data, we fine-tune their model, they get back an endpoint they can query.
CANDIDATE: Okay. This is a full-stack ML systems problem — data intake, training, checkpointing, serving. Let me clarify a few things first.
When you say “fine-tune,” what’s the scope? Full fine-tune, LoRA-style PEFT, or both?
INTERVIEWER: Both. Most customers use LoRA because it’s cheaper. A few want full fine-tune.
CANDIDATE: Okay. Base model: are we fine-tuning from one shared base model, or can customers bring their own base?
INTERVIEWER: One base model for now — a 7B model. Later maybe multiple, but design for one.
CANDIDATE: How many concurrent customers are we talking? 500 enterprise customers like the last problem?
INTERVIEWER: Let’s say 200 customers, each with roughly one to three active fine-tuning jobs at a time. Some customers have large datasets — say 1 million examples. Some have small ones, 10,000 examples.
CANDIDATE: And for the resulting model — do customers get a dedicated endpoint, or does their fine-tuned LoRA adapter get loaded into a shared serving fleet?
INTERVIEWER: Both options should be available. Dedicated endpoint for customers who need performance guarantees. Shared pool with adapter-switching for customers who don’t.
CANDIDATE: Good. Last one: data isolation. Customer A’s data cannot touch Customer B’s model?
INTERVIEWER: Strict. Customer A cannot even see that Customer B exists.
CANDIDATE: Okay. That shapes the architecture significantly. Let me estimate, then design.
200 customers, say 1.5 average active jobs per customer — 300 concurrent fine-tuning jobs. A LoRA job on a 7B model training on 100k examples for 2 epochs takes about 45 minutes on one H100 at batch size 16. A full fine-tune on 1M examples takes maybe 8 hours on 8 H100s. The GPU requirements are highly variable by job.
For scheduling purposes, I’ll assume roughly 300 concurrent jobs, some small (1–2 H100s), some large (8 H100s). Mean GPU requirement per job maybe 4 H100s. So 300 × 4 = ~1200 H100s total in the training fleet, plus headroom for burst. Call it 1500 H100s in the training cluster.
Architecture: I see three main subsystems. Ingestion and validation. Training and scheduling. Registry and serving. Let me walk through each.
Ingestion and validation. Customers upload data to an S3 prefix scoped to their tenant ID, with their own KMS encryption key. This is important for the data isolation guarantee — the training workers can only access data in the customer’s prefix via IAM role assumption. A data validator runs on upload: check format (JSONL with prompt/completion fields), scan for PII that the customer might not want included, deduplicate examples. Validation errors are surfaced asynchronously via webhook or status API.
After validation, a training job is submitted to the job queue (Kafka topic). The job spec includes: tenant ID, dataset location, base model version, hyperparameters, and priority tier.
INTERVIEWER: You said IAM role assumption for data isolation. What does that look like in practice?
CANDIDATE: Each customer has a customer-specific IAM role in their AWS account — or, if this is a single-account multi-tenant design, a role in our account that has a trust policy limited to their prefix. When a training job fires up for customer X, the training container assumes role arn:aws:iam::ACCOUNT:role/customer-X-training, which has GetObject/ListBucket permissions only on s3://our-bucket/tenant-X/. It cannot read s3://our-bucket/tenant-Y/. This is enforced by IAM, not by application code, which is the right place for a security boundary.
INTERVIEWER: Good. Go on.
CANDIDATE: Training and scheduling. The job queue feeds into a GPU scheduler. This is the hardest part of the design. I have 1500 H100s, 300 concurrent jobs, each job needing between 1 and 8 GPUs. The scheduling problem is: allocate GPUs to jobs subject to per-tenant quotas, maximize utilization, and don’t let one large job starve small jobs indefinitely.
I’d use a fair-share scheduler with per-tenant quotas. Each tenant has a max concurrency quota — say 4 H100s at the LoRA tier, 16 H100s at the full fine-tune tier. Jobs that exceed the quota queue rather than getting admitted. Within quota, jobs are bin-packed onto available GPUs to maximize utilization.
Preemption: if a high-priority job arrives (e.g., a customer paying for expedited SLA), and no GPUs are free, the scheduler can preempt a lower-priority job. Preempted jobs checkpoint their state before being killed. For this to work well, checkpointing must be frequent enough that preemption doesn’t lose significant progress — I’d checkpoint every 500 steps.
INTERVIEWER: What if a customer submits a job that has a bug — say, their training loop has an off-by-one error that causes the loss to diverge and the job runs for 24 hours instead of 45 minutes?
CANDIDATE: This is a real ops problem. A stuck job that consumes quota and doesn’t make progress degrades everyone’s throughput. I’d handle it with watchdog monitoring on training health signals.
During training, the training worker emits metrics every N steps: training loss, gradient norm, learning rate. A health checker watches these. If training loss stops decreasing for 100 consecutive steps — that’s usually 15-30 minutes of wall time for small jobs — it fires a “possible divergence” alert. If gradient norm explodes (> 100 for a well-scaled model), that’s an immediate hard signal.
When a health alert fires, the system first notifies the customer via webhook. If the customer doesn’t intervene within 10 minutes, the job is paused and marked “needs review.” The customer can inspect the training logs, fix their data, and resubmit. The quota is released on pause.
I’d also set a wall-clock job timeout — say 48 hours maximum for any job. After 48 hours, the job is killed and the customer is notified. This prevents zombie jobs from holding quota indefinitely.
INTERVIEWER: Hmm. What about checkpointing and rollback? You said the customer should be able to roll back to any prior checkpoint.
CANDIDATE: Yes. Checkpoints are stored in an append-only store — Delta Lake tables on S3 with versioned checkpoint paths, or just S3 with versioned objects. Every N steps, the training worker writes a checkpoint: model weights (or LoRA adapter weights), optimizer state, step count, training config.
The checkpoint manifest is stored separately from the weights, in MLflow or a similar model registry scoped to the tenant. Each checkpoint is an immutable artifact tagged with step number and timestamp. The customer can query “show me all checkpoints for job X” and get back a list of step → S3 path → eval metric (if we ran eval at checkpoint time).
Rollback means: take checkpoint K and deploy it instead of the current checkpoint. The deployment layer (I’ll get to this) treats a rollback as just another deployment from a different source checkpoint. Because the checkpoint store is append-only, rolling back doesn’t delete the newer checkpoint — the customer can re-deploy the newer version if they decide the rollback was wrong.
The one tricky case: if a customer deletes their account, the retention policy has to apply to checkpoints. Enterprise customers often have a 7-year retention requirement — that checkpoint data lives in their S3 bucket under their KMS key, and they control the retention. Our system just stores the pointers.
INTERVIEWER: Okay, and the serving side. LoRA adapter switching versus dedicated endpoint?
CANDIDATE: Two deployment modes.
Mode 1: dedicated endpoint. The customer’s LoRA adapter (or fully fine-tuned model) gets a dedicated vLLM instance. The instance has the base model loaded plus the LoRA adapter merged or loaded dynamically. This gives the customer guaranteed throughput and latency. It’s expensive for customers who have low traffic.
Mode 2: shared adapter pool. A pool of vLLM instances runs the base 7B model. When a request comes in for tenant X, the serving layer routes it to an instance that has tenant X’s LoRA adapter loaded. If no instance has it loaded, it either loads it on-demand (cold start: ~5 seconds for a LoRA adapter) or falls back to a replica that loads it. This is the “LoRA adapter switching” pattern.
The interesting engineering problem in Mode 2: how do you decide which adapters to keep hot on which replicas? It’s a caching problem. If tenant X sends 10 requests/minute and tenant Y sends 1 request/hour, tenant X’s adapter should stay loaded; tenant Y’s can be evicted and reloaded on demand. The eviction policy is LRU by adapter, weighted by recent request rate.
The admission layer tracks which tenant’s adapter is hot on which replica and routes accordingly. A request for tenant X goes to a replica with X hot; if none, it goes to any replica and triggers a load, adding the adapter to the hot set.
INTERVIEWER: Good. What monitoring and alerting does this system need?
CANDIDATE: Four layers.
First, job health: training loss curve, gradient norm, step throughput (tokens/sec), GPU utilization. Alert on any job where loss hasn’t decreased in 100 steps or gradient norm exceeds 50.
Second, scheduler health: queue depth per tier, average wait time per tier, GPU utilization across the fleet. Alert if any tenant’s job has been queued for more than 2× their expected wait time.
Third, data pipeline health: ingestion lag (time from upload to job start), validation error rate by tenant. Alert if a tenant’s data is stuck in validation for more than 30 minutes.
Fourth, serving health: TTFT, TPOT, adapter load latency, adapter eviction rate. Alert if adapter cold start latency is causing more than 5% of requests to see TTFT > 10 seconds.
The one thing I’d add that’s specific to fine-tuning: a post-training eval gate. Before a fine-tuned adapter gets deployed, run a quick eval — maybe 200 examples from a golden set — to confirm the fine-tune didn’t regress on general capability while improving on the customer’s task. If it did (which can happen with aggressive learning rates or domain-specific data), surface the result to the customer before they deploy. They can choose to proceed or not.
INTERVIEWER: That’s a good point. Last question: what’s different about this design if the base model is 70B instead of 7B?
CANDIDATE: Several things change.
First, the minimum hardware unit for full fine-tune goes from 1 H100 to 8 H100s for the weights alone — 70B FP16 is 140 GB, and you need optimizer states and activations on top. So full fine-tune on 70B requires model parallelism — FSDP with ZeRO Stage 3, or pipeline parallelism across nodes. The scheduler has to allocate multi-node jobs, which adds topological awareness: I want to assign the 8 nodes in the same rack or at least the same availability zone to minimize all-reduce latency over cross-AZ networking.
For LoRA on 70B, the adapter is still small (a few hundred MB), but loading the base model for inference requires 2× H100 at FP8. The Mode 2 shared pool becomes more expensive — each replica in the shared pool ties up 2 H100s.
The economics of Mode 2 shift: at 7B, you can have 20 adapters in a shared pool with 20 H100s (one model copy per H100). At 70B, that same 20 H100s holds only 5 model copies — each taking 4 H100s at FP16 (or 2 H100s at FP8). So you can hot-serve 5 tenants simultaneously, and the adapter eviction becomes much more aggressive. Tenants with < 10 requests/day would basically always incur cold starts. That’s a product conversation about pricing tiers, not just an engineering decision.
INTERVIEWER: Good. That was a thorough answer. I think we’re done.
Interviewer notes — what this candidate did well:
- The IAM role assumption for data isolation was exactly right — enforcement at the IAM layer, not the application layer, is what a senior candidate says.
- Fair-share scheduling with preemption, and the requirement for frequent checkpoints to support preemption — this shows real understanding of GPU scheduling systems.
- The stuck-job watchdog (loss plateau detection, gradient norm explosion) is a concrete production pattern. Most candidates just say “we’d set a timeout.”
- The append-only checkpoint store and the rollback-as-redeployment framing was clean.
- The LoRA adapter LRU cache in the shared pool was well-reasoned.
- The 70B question was handled well — correctly identified the multi-node job topology problem and the economic shift.
What was missing or under-developed:
- No explicit discussion of customer-visible billing or quota usage APIs. Enterprise customers want to know how much they’ve spent. This is an obvious product requirement.
- The data validation section mentioned PII scanning but didn’t describe what happens when PII is found — is the job rejected? Is the PII redacted? Does the customer get to decide?
- The distributed training failure handling was light — what happens when one node in an 8-node FSDP job crashes mid-training? This is a real problem and a good drill target.
Verdict: Strong IC5, borderline IC6. The 70B scaling question was handled well, which pushes toward IC6. The billing and distributed failure gaps would need to be addressed in the conversation to confirm IC6.
130.6 Reading your own transcript
The mock transcripts in this chapter are written down. Real interviews are not. You finish a 45-minute session, leave the building (or close the Zoom call), and within 10 minutes the details are already fading. The candidates who improve fastest are the ones who reconstruct and grade their own transcript before the details disappear.
Here is the process. As soon as the interview ends, find 15 minutes. Write down (roughly, from memory):
- What question did the interviewer ask, and what clarifying questions did you ask?
- What capacity estimate did you give, and was it reasonable?
- What was your architecture — the three-sentence version?
- What did the interviewer push back on, and what did you say?
- Did you cover ops at the end?
Then grade yourself on five questions.
Did I clarify? Did I ask four to six targeted questions before drawing anything? Did I write the answers down? If you drew a diagram in the first five minutes without asking a single question, that’s an automatic loss condition. It doesn’t matter how good the diagram was.
Did I quantify? Did I produce a number for QPS, storage, GPU count, and cost? Did I say my assumptions out loud, or did the numbers appear from thin air? If the interviewer could not have corrected a wrong assumption because you never stated it, that’s a failure. The numbers don’t have to be right; the process of getting to them does.
Did I commit? Senior engineers commit to specific technologies, specific numbers, specific tradeoffs. If every sentence in your transcript contains “it depends” or “we could either do A or B,” you were not committing. The interviewer heard uncertainty, not flexibility. Commit to one option and explain when you would switch to the other.
Did I drill deep? Pick the 10-minute drill section of your transcript and evaluate it by itself. Did you get down to the level of specific failure modes, specific metrics, specific operational behaviors? Or did you name concepts without explaining them? A correct name without a mechanism is mid-level depth. A mechanism with failure modes and metrics is senior depth.
Did I cover ops? Did the last five minutes of the conversation include deployment, monitoring, and failure modes — or did the interview end at the architecture diagram? Ops is not optional. If you ran out of time before reaching it, that means the earlier phases sprawled, which is a time-management failure.
The candidates who improve fastest are not the ones who practice more mock interviews. They are the ones who debrief each mock with the same discipline as a production postmortem. Find what failed. Find why. Change one specific behavior. Run it again.
Read it yourself
- Glassdoor and Blind ML systems interview threads for the specific companies in §130.1 — the interview format shifts quarterly and the internet is a better real-time tracker than any book.
- The Anthropic and OpenAI model cards — reading what these companies publish about their eval methodology tells you what the interviewers care about.
- The Databricks blog on Delta Lake and MLflow — the Databricks interview vocabulary is built from these internal systems.
- Stripe’s engineering blog, especially posts on payment systems reliability — the idempotency vocabulary in §130.1 comes directly from how they write about their own systems.
- Meta’s infrastructure engineering blog, especially posts on multi-tenant recommendation serving and the MTIA chip — for the Meta-specific design vocabulary.
- The AWS re:Invent session recordings on SageMaker architecture — the service design of SageMaker is effectively what the interviewer is asking you to reason about.
Practice
- Pick two companies from §130.1 that you’re actually targeting. For each, write a one-paragraph “pre-interview brief” — what you will say in the first five minutes to signal you’ve adapted the answer to their context.
- Read Mock Transcript 1 again and find the three moments where the candidate could have gone deeper. For each, write the sentence they should have added.
- For Mock Transcript 2, draw the query-path architecture from memory without looking at the diagram. Label every box with a technology choice.
- Mock Transcript 3 ends with a question about 70B. Write an extended answer (two additional paragraphs) covering the distributed training failure scenario the interviewer notes section flagged as missing.
- Do a self-recorded mock on “Design a multi-tenant LLM fine-tuning service.” Stop after 45 minutes. Grade yourself on the five questions in §130.6. Write down the single most specific thing you’d change.
- Stretch: Find a partner and trade mocks: you interview them on RAG, they interview them on fine-tuning. Grade each other using the interviewer notes format from this chapter — score what was strong, what was missing, and what would have swung the IC5-vs-IC6 call.