Chapter 116: Design a chatbot for 1M users

This is the first worked example of the Part X design interview series. “Design a chatbot for 1 million users” is the canonical opener — every big tech ML systems loop has some version of it. The point of working it end to end is not the specific answer (which changes with each question’s constraints); the point is seeing what it looks like when a senior candidate executes the framework from Chapter 114 with the numerical kit from Chapter 115.

This chapter is denser than a typical chapter because every phase of the interview is worked explicitly. Expect it to read like a transcript of a whiteboard session, with the candidate’s internal monologue visible. The design choices made are illustrative, not canonical — for every one, an alternative is named.

Outline:

Clarify: six questions and their answers.
Estimate: users to GPUs to dollars, in 90 seconds.
High-level architecture: one diagram, every box labeled.
Drill 1: the LLM serving fleet.
Drill 2: caching, admission, and safety.
Autoscaling and cold starts.
Observability and SLOs.
Failure modes and the on-call experience.
Rollout and evaluation.
Tradeoffs volunteered along the way.
The mental model.

116.1 Clarify: six questions and their answers

The candidate opens with the framework plan (§114.2), then runs the clarify phase. Six questions, for the canonical version of this problem:

1. Is “1M users” MAU, DAU, or total registered? The interviewer says 1M DAU. That’s the number that matters for capacity.

2. What’s the expected session volume? Assume 5 conversational turns per session, 3 sessions per user per day on average. Peak is ~4× daytime average.

3. What’s the latency target? Time to first token (TTFT) under 500 ms p95, time per output token (TPOT) under 50 ms p95. These are the numbers that feel snappy for a chat experience. Total response for a 300-token completion is ~500 ms + 300 × 50 ms = 15.5 seconds worst-case, roughly 8 seconds typical.

4. What’s the quality target? “As good as GPT-4o-mini” for most queries, with an escalation path to a stronger model for ~5% of traffic. The candidate writes this down because it shapes the model-routing layer.

5. What are the safety requirements? Must block CSAM, self-harm, and prompt injection attempts. Hate speech and PII filtering are required. Appeals path needed for false positives. No human review in-line for latency reasons, only async for flagged traffic.

6. Is this multi-region? Yes, three regions (US, EU, APAC) for latency and data residency. EU data must stay in EU.

The candidate writes each answer next to its question on the board. These are the six constraints the design has to satisfy.

116.2 Estimate: users to GPUs to dollars, in 90 seconds

Walking Chain 1 from Chapter 115 out loud:

DAU: 1,000,000.
Sessions per day: 1M × 3 = 3M.
Turns per day: 3M × 5 = 15M turns per day.
Tokens per turn: Assume ~800 prompt (including recent conversation context) + ~400 output = 1200 total. Output is what costs us compute, so focus on 400.
Output tokens per day: 15M × 400 = 6 billion output tokens per day.
Input tokens per day: 15M × 800 = 12 billion input tokens per day.
Average output tokens per second: 6B / 86,400 ≈ 70,000 tokens/sec.
Peak output tokens per second: ×4 = 280,000 tokens/sec.

Now the GPU math. The candidate assumes a 70B-class model at FP8 on H100, achieving ~1500 tokens/sec realistic decode throughput per GPU (Chapter 115, §115.5). For the 5% escalation to a stronger model, assume a similar 70B model (the escalation is for harder prompts, not a bigger model — a common pragmatic choice).

GPUs at peak for decode: 280,000 / 1500 ≈ 190 GPUs.
Add prefill capacity: prefill is separate compute. For ~12B input tokens/day, peak is ~140k tokens/sec. At ~2500 prefill tokens/sec per GPU (Chapter 30, §30.3), that’s ~55 GPUs. Prefill and decode share GPUs in vLLM’s chunked-prefill mode, so the total is roughly max(190, 190+55/4) ≈ 210 GPUs when blended.
Headroom: ×1.3 for rolling deploys, spikes, and regional variance → 275 GPUs.
Spread across three regions: ~100 per region.
Cost: 275 × $2/hour × 720 hours/month = $396k/month for inference compute.

In dollars per million tokens: $396k ÷ 6B × 1M / 30 days ≈ $2.20 per million output tokens. That’s higher than the $0.50 spot number from Chapter 30 because this includes peak sizing, regional spread, and headroom — realistic production overhead.

The candidate announces: “I’m going to target a 275-H100 fleet at about $400k/month. If the budget is tighter, the levers are FP8 to INT4 (~1.7× throughput), a smaller model (8B handles ~60% of traffic based on typical routing), and aggressive prefix caching. If the budget is looser, I’d go to 8× H200 per region for lower tail latency.”

116.3 High-level architecture

The diagram the candidate draws on the board:

   [ mobile / web client ]
         |
         v
   [ CDN / edge (Cloudflare or CloudFront) ]
         |
         v
   [ API gateway (Envoy AI Gateway) ]
      - OpenAI-compatible front door
      - JWT auth, rate limit (token bucket)
      - request validation, OpenTelemetry entry span
         |
         v
   [ admission / router ]
      - per-tenant quota check
      - model selection (small vs large)
      - region pinning (EU -> EU fleet)
         |
         v
   [ safety pre-filter ]
      - regex blocklists (fast path)
      - small classifier (Llama Guard class, <30 ms)
      - if flagged: reject with appeal link
         |
         v
   [ LLM serving fleet ] <---> [ shared prefix cache (LMCache + Redis) ]
      - vLLM on KServe, TP=2 on H100
      - continuous batching, chunked prefill
      - prefix caching (system prompt is ~2k tokens)
      - streaming via SSE
         |
         v
   [ safety post-filter ]
      - streaming-aware: flag on finalized chunks
      - async human review for edge cases
         |
         v
   [ metering sidecar ]
      - emit billing event per turn
      - partitioned Kafka topic
         |
         v
   [ telemetry pipeline ]
      - Prometheus metrics
      - OpenTelemetry traces -> Tempo
      - Loki for logs

graph LR
  Client["Client"] --> CDN["CDN / Edge"]
  CDN --> GW["Envoy AI Gateway\nJWT · rate limit · OTel"]
  GW --> ADM["Admission + Router\nquota · model select · region pin"]
  ADM --> PRE["Safety Pre-filter\nregex + Llama Guard &lt;30 ms"]
  PRE --> FLEET["LLM Serving Fleet\nvLLM TP=2 · FP8 · SSE stream"]
  FLEET <--> PC["Prefix Cache\nLMCache + Redis"]
  FLEET --> POST["Safety Post-filter\nstreaming chunk scan"]
  POST --> METER["Metering Sidecar\nKafka billing event"]
  METER --> TEL["Telemetry\nPrometheus · Tempo · Loki"]
  FLEET <--> SS["Session Store\nRedis TTL"]
  style FLEET fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The LLM serving fleet is the money-burning center of the design — every other box exists to feed it cleanly or report what it produced.

A session state store (Redis) holds per-user conversation history with TTL. A vector store may sit alongside the router for retrieval-augmented lookups against user preferences or memory, but the candidate explicitly flags “I’m not going to treat this as a RAG system — that’s Chapter 117. Here, the retrieval is lightweight per-user memory, not corpus retrieval.”

The candidate names every technology choice and, for each, volunteers the alternative they considered. Envoy AI Gateway was chosen over a custom gateway because it gives OpenAI compatibility for free. vLLM was chosen over SGLang because the organization has more operational experience with it, though SGLang’s RadixAttention would help prefix caching. KServe was chosen over raw Kubernetes Deployments because canary rollouts and traffic splitting come for free.

The whole diagram takes eight minutes. The interviewer nods at “LLM serving fleet” and says: “let’s drill into that.”

116.4 Drill 1: the LLM serving fleet

The LLM serving fleet is where the money lives and where the candidate earns or loses the interview.

The job. Accept an OpenAI-compatible chat/completions request, produce streaming tokens within TTFT 500 ms and TPOT 50 ms p95, emit token-level telemetry, return metering events on completion.

TTFT is dominated by queue delay and prefill time; reducing TTFT means smaller batches or faster prefill, which costs throughput — the fundamental latency/throughput tradeoff.

Request lifecycle. A request arrives at a vLLM replica. It carries a tokenized prompt (~800 tokens typical). vLLM’s scheduler admits it into the current batch at the next scheduling step — typically within 5–20 ms. Prefill runs: for 800 tokens on Llama 70B FP8, that’s ~800 / 3000 ≈ 270 ms of prefill time (peak prefill at chunked prefill batch size 4096 tokens). TTFT is therefore dominated by queue delay + prefill time — roughly 280–400 ms under normal load. Decode begins. Each decode step produces one token across all in-flight requests in the batch; at batch size 64, that’s ~40 ms per step, giving TPOT ≈ 40 ms — under the 50 ms budget. The streamed tokens flow back through SSE to the client.

Dominant constraints. Two. First, KV cache memory (Chapter 22): Llama 70B FP8 has ~160 KB per token KV cache. At 8k context × 64 concurrent users = 512M tokens × 160 KB — wait, that arithmetic is wrong. 8192 tokens × 64 users × 160 KB = ~80 GB per replica of KV cache. On a 2×H100 replica (160 GB total), minus 70 GB of model weights, leaves 90 GB — enough. But bump context to 32k and it explodes. Second, batch composition: mixing long and short contexts in the same batch leads to uneven compute per step, wasting GPU.

Optimizations that apply.

Continuous batching (Chapter 23). Essential; non-negotiable.
Prefix caching (Chapter 29). The system prompt is ~2k tokens shared across every request. With prefix caching on, those 2k tokens are paid for once per replica, not once per request. On Llama 70B FP8, that saves ~320 KB × 2048 = 650 MB of KV cache per 10k requests. More importantly, it cuts prefill time for system-prompted requests by ~40%.
Chunked prefill. vLLM’s chunked prefill lets long prompts interleave with ongoing decode steps, smoothing tail latency when a new request arrives with a big prompt.
Speculative decoding (Chapter 27). A small 1B-parameter draft model doubles decode throughput for predictable text. Costs VRAM for the draft model; worth it if the workload is text-heavy (it is).
FP8 quantization (Chapter 26). ~1.7× throughput over bf16, minimal quality loss on Llama 70B. Already assumed.
Tensor parallelism TP=2 for the 70B model. PP is not worth the complexity at this scale; TP=2 fits the model and keeps per-token latency low.

Failure modes.

KV cache full. vLLM drops low-priority requests and re-schedules them, causing spikes in p99 TTFT. Mitigation: watch vllm:gpu_cache_usage_perc and scale out at 85%, not 95%.
One slow request poisons the batch. A request with 32k context slows every other request in its batch. Mitigation: route long-context traffic to a dedicated fleet with higher-memory GPUs (H200).
Hot spot on a shared prefix. If 10k users all hit one popular system prompt, the prefix cache is great, but if that prompt is updated mid-flight, the cache is invalidated. Mitigation: version prompts and let the cache natively key by version.
OOM on cold start. Model weights take 60 seconds to load from NVMe; if the --gpu-memory-utilization is wrong, the first request after warmup can OOM. Mitigation: explicit warmup phase with synthetic traffic before marking the pod Ready.

Metrics and alerts.

vllm:num_requests_running, num_requests_waiting — autoscale trigger.
vllm:gpu_cache_usage_perc — backstop autoscale trigger.
TTFT p50, p95, p99 — alert if p95 > 600 ms for 5 minutes.
TPOT p50, p95, p99 — alert if p95 > 60 ms.
vllm:time_to_first_token_seconds histogram — for distribution analysis.
vllm:e2e_request_latency_seconds — user-perceived total.
Tokens/sec by model — for cost-per-token tracking.
vllm:num_preemptions_total — preemption rate; spike signals KV pressure.

The candidate pauses. The interviewer asks “what about cost reduction?” Prompt for drill 2.

116.5 Drill 2: caching, admission, and safety

Prefix caching. The largest lever. Every user’s conversation shares a ~2k-token system prompt. Turning prefix caching on in vLLM is a one-line flag (enable_prefix_caching=True). It cuts prefill compute by ~60% for system-prompted traffic, which is most traffic. The saving is roughly 30% of total GPU cost. The cost: a small increase in memory footprint (the shared prefix blocks stay resident).

Cross-replica prefix caching with LMCache (Chapter 53). When a user’s conversation moves from replica A to replica B (e.g., after a rollout), the prefix cache on A is cold for B. LMCache shares prefix blocks across replicas via Redis, so a replica that hasn’t seen the prompt yet can fetch the KV cache for the shared prefix instead of recomputing. Per-request latency wins 50–200 ms on cache hits. Adds a Redis hop (~500 μs) on misses.

Prompt/response caching. For identical prompts (rare in chat, common in some apps), cache the full response keyed by hash of prompt + model + temperature. Redis, TTL of hours. Hit rate for a chat workload is low (~5%); for a Q&A product it can be 30%+. Worth implementing but not a dominant lever here.

Admission control. Per-tenant token-bucket rate limiting (Chapter 76). Free tier: 20 requests/minute. Paid tier: 300 requests/minute. Premium: uncapped. The admission layer rejects over-quota requests with a 429 before they touch the GPU. Without this, a single runaway client can starve the rest of the fleet.

Load shedding. When the fleet is past a watermark (e.g., 90% of KV cache used, or p95 TTFT > 800 ms), the admission layer starts returning 503s for free-tier traffic while still admitting paid-tier. This is “back-pressure not failure” (Chapter 122) — the front door absorbs load before the expensive GPU layer does.

Safety pre-filter. Two stages. A regex and rules stage runs in the gateway: <5 ms, catches obvious blocklisted terms and prompt-injection patterns. A small classifier stage runs as a separate TEI-hosted service (Chapter 49): Llama Guard–class model, ~30 ms, catches nuanced policy violations. If flagged, the request is rejected with a tagged error that carries an appeal ID.

Safety post-filter. The streamed output is scanned chunk-by-chunk for policy violations. If a violation appears mid-stream, the stream is truncated and a policy notice is appended. This is delicate: truncating a stream cleanly without breaking the client SSE connection requires the gateway to be aware of the content policy. Appeals (flagged false positives) are sent to a human review queue async.

The candidate spends ~5 minutes on this drill and moves on.

116.6 Autoscaling and cold starts

Autoscaling uses KEDA (Chapter 51) on vllm:num_requests_running with a target of 40 requests per replica. Backstop trigger on vllm:gpu_cache_usage_perc at 85%. Minimum replicas per region: 10 (to keep cold-start pain off the critical path). Maximum: 150 per region (the peak sizing with headroom).

Cold starts are the painful part. A pod that has to pull Llama 70B FP8 (~35 GB) from a registry across AZs takes 60–90 seconds. Weights-on-node-local-NVMe pattern (Chapter 52) brings this to 15–25 seconds. A synthetic warmup run after weight load — one prefill + ten decode steps — adds another ~5 seconds but ensures the CUDA kernels and batch machinery are hot. Only then does the readiness probe flip green.

Scale-up triggers: if num_requests_running average over 5 minutes > 40 per replica, add one replica per minute, up to the max. Scale-down triggers: if average < 20 per replica for 10 minutes, remove one replica. Scale-down is slow on purpose — premature scale-down under variable load causes thrashing, and GPUs are expensive to restart.

The candidate volunteers: “I’d also pre-scale on a schedule. Daytime US peak is predictable; I’d ramp the US region to 80% of its peak capacity by 8 a.m. Pacific automatically. KEDA cron triggers do this cleanly.”

116.7 Observability and SLOs

SLIs and SLOs for the chatbot:

TTFT SLI: p95 TTFT over 5-minute window. SLO: < 600 ms 99% of the time.
TPOT SLI: p95 TPOT over 5-minute window. SLO: < 60 ms 99% of the time.
Availability SLI: successful responses (non-5xx) over total requests. SLO: 99.9% over a 28-day window.
Quality SLI: proxy based on user feedback (thumbs-up/down) and periodic LLM-judge evals. SLO: judge score ≥ 0.85 on a golden set.
Safety SLI: false positive rate on moderation. SLO: < 2% of benign traffic flagged.

Error budget for availability is 0.1% × 28 days × 86,400 seconds ≈ 2400 seconds. If we burn 1200 in an hour, we stop shipping risky changes and start reviewing deploys (see Chapter 97).

Observability stack:

Metrics: Prometheus scraping vLLM at 15-second intervals. KEDA reads from Prometheus via the prometheus-adapter scaler.
Logs: structured JSON logs shipped by Vector to Loki. One log line per request at info level; trace-level logs are sampled at 0.1%.
Traces: OpenTelemetry traces with W3C trace-context propagation from gateway through admission → safety → serving → metering. Tail-based sampling at 1% of normal traffic, 100% of errors.
Dashboards: one top-level dashboard per region with the four golden signals per stage. A “serving health” dashboard per replica pool with KV cache utilization, batch size, and preemption count.

116.8 Failure modes and the on-call experience

The top five production failure modes the candidate would write a runbook for:

1. TTFT regression after deploy. Symptom: p95 TTFT doubles after a new model version ships. Likely cause: FP8 calibration drift, or prefix cache invalidation. Runbook: roll back, re-run the calibration, roll forward. SLO burn rate alert fires before users notice.

2. KV cache exhaustion. Symptom: num_preemptions_total spikes, p99 TTFT grows. Cause: traffic mix shift (more long-context users). Runbook: bump scale-up aggressiveness, consider routing long-context to H200 fleet. Manual scale-out if automatic is too slow.

3. Safety service down. Symptom: admission layer rejecting 100% of traffic. Cause: TEI pod crash-looping or out-of-memory. Runbook: fail-open on admission for a short window (5 minutes), with heightened logging, while the safety fleet is repaired. Fail-open vs fail-closed is a policy decision — for a chatbot, fail-open with audit trail is usually right; for CSAM detection, fail-closed.

4. Regional outage. Symptom: US region unreachable. Cause: AZ failure, network partition. Runbook: DNS failover to nearest region, accept the latency hit (100–200 ms extra). EU data must still stay in EU, so EU users get degraded latency to EU-west if EU-central is down.

5. Bad model shipped. Symptom: quality golden-set evals drop after rollout. Cause: fine-tune regression, bad prompt update, tokenizer mismatch. Runbook: shadow traffic eval catches this before users; if not, roll back via KServe traffic split within 5 minutes.

On-call experience: primary gets paged on any SLO fast-burn (~14× normal rate) or any availability drop below 99.5%. Secondary backs up. The runbook is a Notion page with one heading per alert, keyed to a unique alert ID. Most pages are self-resolving (autoscaling); the rest require 5–30 minutes of human attention.

116.9 Rollout and evaluation

Rollouts are KServe canary splits. A new model version is deployed alongside the existing one, with traffic shifted progressively: 1% → 5% → 25% → 100%, with 15-minute bake periods at each step and automatic rollback if any SLO breaches. The shadow eval pattern (Chapter 98) runs the golden set against the new version before any traffic shift; if the golden-set score drops more than 2%, the shift doesn’t start.

Model evaluation has three layers:

Offline golden set: 500 carefully curated prompts with expected-quality scoring. Runs on every candidate model before deploy.
LLM-as-judge: 5000 production-like prompts scored by a larger model. Runs nightly.
Online A/B: user thumbs-up/down rate compared between traffic splits. Requires ~24 hours of data per split to be significant.

The candidate closes: “I’d want the golden set to grow over time from real production failures. Any incident where a user reported a bad response gets added to the set, so the next model release can’t regress on it. That’s the compounding advantage of a good eval discipline.”

116.10 Tradeoffs volunteered along the way

A count of the explicit tradeoffs the candidate made in the hour:

Envoy AI Gateway vs custom: Envoy for OpenAI compatibility; would build custom if routing logic became too complex.
vLLM vs SGLang: vLLM for operational maturity; SGLang would win on prefix-heavy workloads via RadixAttention.
70B FP8 vs 70B bf16: FP8 for throughput; bf16 if quality regressions showed up on the golden set.
TP=2 vs TP=4: TP=2 for throughput per dollar; TP=4 if per-request latency needed to drop.
KEDA vs Knative: KEDA for fine-grained metric triggers; Knative would be simpler for scale-to-zero (not needed here since min > 0).
Inline safety vs async: inline for the first line of defense; async for nuanced appeals.
LMCache vs per-replica prefix caching: per-replica is simpler; LMCache is worth the Redis hop if cross-replica hit rate is >20%.
Fail-open vs fail-closed on safety: fail-open with audit trail for user-facing, fail-closed for CSAM-class content.

Eight explicit tradeoffs in 45 minutes is a strong performance. More would look rehearsed; fewer would look shallow.

116.11 The mental model

Eight points to take into Chapter 117:

Clarify first, always. Six questions before any math or diagrams.
The estimate chain is 90 seconds from users to GPUs to dollars. Announce every assumption.
The reference architecture has eight boxes. Gateway, admission, safety pre, LLM fleet, safety post, metering, telemetry, session store. Know it by heart.
KV cache and batch composition are the dominant constraints on LLM serving. Every serving optimization maps to one or both.
Prefix caching is the single biggest cost lever for chat workloads with shared system prompts. Implement it before anything else.
Autoscale on num_requests_running, not CPU. Min > 0 to hide cold starts. Pre-scale on schedule where traffic is predictable.
SLOs are user-facing, not infrastructure. TTFT, TPOT, availability, quality, safety false-positive rate. Five SLOs for a chatbot.
Rollouts are canary + shadow eval, not big-bang. Golden set gates every deploy. Incidents feed the golden set.

Chapter 117 takes the same framework and applies it to a retrieval-heavy variant: a RAG system over 10 TB of documents.

Read it yourself

The vLLM documentation, especially the section on production deployment and autoscaling.
Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention — for the KV cache math that drives serving capacity.
The Envoy AI Gateway documentation — for the OpenAI-compatible routing pattern.
The KServe docs on InferenceService and traffic splitting.
Google SRE book, chapter 5 (“Eliminating Toil”) and chapter 11 (“Being On-Call”) — for the operational vocabulary used in §116.8.
Alex Xu, System Design Interview — An Insider’s Guide, vol. 2, chapter on chat systems — the non-ML version of this design.

Practice

Re-do the estimation chain in §116.2 for 10M DAU instead of 1M. What’s the GPU count and monthly cost? At what user count does a single region stop fitting in 150 H100s?
Suppose the latency requirement changes to TTFT p99 < 200 ms. What does the fleet architecture have to change? (Hint: batch size, TP, replica count.)
The interviewer says “the budget is $100k/month, not $400k. Redesign.” Walk through the levers in order of cost impact.
You’re told 30% of traffic is RAG-augmented with 6k tokens of context. Redo the KV cache math. Do you need H200s?
Design the rollout plan for a new model version. Include the eval gate, canary schedule, automated rollback triggers, and manual rollback procedure.
Add a feature: per-user conversation memory that spans sessions. Design the storage, retrieval, and freshness model. How does it interact with the LLM serving fleet?
Stretch: The interviewer says “halfway through, we’re changing the problem: make it multimodal, with image and audio inputs.” Rewind the design from §116.3 and rebuild the diagram. Which boxes change? Which stay the same? What new metrics and new failure modes emerge?