Design: the frontier scenarios
"The frontier is where every assumption from the previous chapter breaks — and where the interview separates the candidate who has shipped it from the one who has only read about it"
The three problems in this chapter are the hardest design interviews in the ML systems category. They are “frontier” in the sense that the production patterns are still forming, the trade-offs are sharper, and the failure modes are more surprising. A long-term memory store for a conversational AI requires you to solve privacy, relevance decay, and cross-device sync simultaneously. An online-learning recommender requires you to reason about counterfactual bias, exploration/exploitation, and incremental model updates as a single integrated system. A multimodal chat backend requires you to design five separate encoding pipelines, manage a shared context window budget across modalities, and handle safety at each layer — all within a latency budget that streaming token output demands. These are not beginner problems. They are what you design when you have already shipped a chatbot, a RAG system, and a recommendation pipeline and your organization has run out of easier problems.
§127.1 Design a long-term memory store for a chat assistant
The problem
The interviewer asks: “ChatGPT-style assistants lose all context when the session ends. Design a long-term memory system so the assistant remembers what it learned about a user across sessions, retrieves relevant memory at query time, and handles GDPR deletion requests.”
What they are testing: do you understand the tension between perfect recall (store everything) and useful recall (retrieve what matters now)? Do you understand the privacy implications of storing a compressed representation of what a user said months ago? Do you understand recency vs relevance scoring for retrieval? The interviewer is looking for a candidate who thinks about memory as a database problem, not as a RAG problem with a longer time horizon.
Clarifying questions
- What is a “memory”? A verbatim quote from a past session? A summary of a session? A structured fact extracted from conversation (user is a vegetarian, user has a 7-year-old daughter)? The answer determines storage format and retrieval strategy.
- How much memory per user? Unlimited? Or a cap (e.g., 10 MB or 1,000 memories)? Without a cap, a long-term user accumulates unbounded data, and retrieval quality degrades.
- What is the retrieval trigger? At every user turn, query memory for relevant context? Or only on specific question types?
- Cross-device sync? Same memory across mobile, web, and API? This adds a sync protocol design problem.
- GDPR and deletion? The user must be able to see their memories and delete any of them. The assistant must never reconstruct deleted memories from other signals.
- What does “stale memory” mean? A user’s job has changed since the memory was created. How does the system handle contradictory memories from different time periods?
Back-of-envelope
Memory density: assume a power user has 10,000 memories after 2 years of daily use. Each memory is a short structured fact plus a compact embedding: ~500 bytes of text + 1024-dim fp16 embedding (2 KB) = ~2.5 KB per memory. Per user: 10,000 × 2.5 KB = 25 MB. For 10M users: 250 TB of memory storage. At S3 pricing: ~$5,750/month. Manageable.
Retrieval latency: at query time, we need to find the top-k memories relevant to the current conversation. 10,000 memories per user, HNSW in-memory, single-user index: sub-millisecond. But we cannot maintain a separate HNSW index per user in RAM at 10M users: 10M × 25 MB = 250 TB of RAM. The index must be on disk (DiskANN) or in a shared multi-tenant vector store (Qdrant with per-user namespacing). Qdrant with a per-user collection can serve 10M users from a 500-node cluster (each node holds ~20k user collections in 30 GB RAM for the active set). Query latency: 5–20 ms for a Qdrant point search, well within the pre-generation latency budget.
Summarization cost: at query end, a session is summarized by a small model (8B, ~$0.11/M tokens). An average session generates ~5,000 tokens of conversation. Summarizing to 10 memories of 50 tokens each: 5,000 tokens in + 500 tokens out = 5,500 tokens per session. For 10M DAU with 3 sessions each: 165B tokens/day of summarization. At $0.11/M: $18,150/day = $545k/month. This is expensive. Mitigation: only summarize sessions that exceed a significance threshold (session length > 200 tokens, or user sends a command to “remember this”). Target: 10% of sessions trigger summarization → $55k/month.
Architecture
graph LR
SESSION["Session\n(chat turns)"] --> SIGFILT["Significance Filter\n(session length · explicit commands)"]
SIGFILT -->|"significant"| EXTRACTOR["Memory Extractor\n8B model: session → memory candidates\nstructured JSON facts"]
EXTRACTOR --> DEDUP["Memory Dedup\ncheck vs existing memories\nembed + cosine similarity"]
DEDUP -->|"new"| MEMSTORE["Memory Store\nQdrant: per-user namespace\nvector + metadata + timestamp"]
DEDUP -->|"contradicts existing"| RESOLVE["Conflict Resolver\nmark old memory as superseded\nrecency wins by default"]
MEMSTORE --> RETRIEVAL["Retrieval Service\nquery-time: top-k by recency×relevance\naug context window before LLM call"]
RETRIEVAL --> LLM["LLM Serving Fleet\nconversation + retrieved memories\n→ response"]
MEMSTORE --> GDPR["GDPR Export / Delete\nper-user memory list API\nhard delete + vector deletion"]
SESSION --> SYNC["Cross-device Sync\nWebSocket + Redis pub/sub\nversion vector per user"]
style MEMSTORE fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The memory store is the persistent organ; the extractor and retrieval service are its input and output valves — quality of extraction and precision of retrieval determine whether memory feels magical or creepy.
Memory extraction runs asynchronously at session end (not on the critical path). A small 8B model is given the full session transcript and a structured extraction prompt:
From this conversation, extract factual user preferences and profile information
as a list of JSON objects: {"fact": "...", "confidence": 0-1, "type": "preference|biography|context"}.
Only extract durable facts, not conversation-specific observations.
Output is filtered by confidence threshold (> 0.7) and deduplicated against existing memories (cosine similarity > 0.9 with any existing memory → skip; similarity 0.7–0.9 → flag for conflict resolution). Extracted memories are tagged with: user ID, timestamp, session ID (for provenance), source conversation hash (for deletion — if the user deletes the source session, its derived memories are also flagged for deletion).
Memory retrieval augments the context window before each LLM call. At the start of a user’s turn, the retrieval service embeds the current turn (plus the last 3 turns of conversation for context), queries the user’s Qdrant namespace for top-10 memories, and scores them by a combined relevance × recency function:
score = cosine_similarity(query_embedding, memory_embedding)
× recency_decay(days_since_created)
× confidence_score
recency_decay is a sigmoid that applies a 50% penalty after 180 days and a 90% penalty after 2 years. This prevents very old memories from dominating retrieval even when their embedding similarity is high (the user’s job 3 years ago is less relevant than their job last month). The top-3 to top-5 memories by this score are included in the system prompt as a [User profile] section.
Memory hierarchy addresses the “too much detail” problem. A user with 10,000 memories cannot have all of them in the context window, but retrieving only top-5 may miss important context. The solution is a two-tier hierarchy:
- Tier 1: Atomic facts — individual extracted memories, stored in Qdrant, retrieved by semantic similarity.
- Tier 2: Session summaries — once a month, a background job summarizes all atomic facts into a “user biography” (~1,000 tokens). This biography is always included in the context window (it is always relevant) while atomic facts are used for specific query-time lookups.
The biography generation runs monthly per user for active users. Cost: ~5,000 tokens in (all atomic facts) + ~1,000 tokens out = 6,000 tokens. For 1M active users: 6B tokens/month at $0.11/M = $660/month. Negligible.
GDPR deletion: the user sees a list of all their memories in a UI (sorted by recency). They can delete any memory individually or all at once. Deletion is hard: the atomic fact is removed from Qdrant, the text is removed from the metadata store (Postgres), and all derived session summaries that referenced this memory are flagged for regeneration. The regeneration is async: within 30 days (GDPR allows 30 days for deletion to propagate), all derived summaries are regenerated without the deleted fact. The session ID provenance link means batch deletion is possible: “delete all memories derived from session X” is a single query.
Cross-device sync uses a version vector (logical clock per user per device) to detect and resolve conflicts when the same user writes a memory from two devices simultaneously. In practice, memory writes are rare and conflicts are almost never meaningful (two devices recording the same fact from different sessions are typically resolved by deduplication). The sync protocol: on reconnect, each device sends its version vector to the server, the server replies with the delta of memories created since the device’s last known version, and the device applies them. Eventual consistency is acceptable for memory; no user notices if a memory takes 30 seconds to propagate across devices.
Key trade-offs
- Verbatim storage vs extracted facts. Verbatim storage is auditable (you can show the user exactly what was stored) and enables GDPR-compliant deletion of specific quotes. Extracted facts are more useful for retrieval (they are normalized and deduplicated) but harder to audit. Production choice: store extracted facts, with provenance back to the session ID. This is auditable at the session level without storing every word verbatim.
- Recency wins vs relevance wins. A user who had a cat 5 years ago and a dog now: if they ask about their pet, recency-weighted retrieval surfaces the dog. But if the user asks about their childhood pet, the cat memory is the relevant one. The recency decay curve is tunable; the senior candidate sets it as a hyperparameter, not a constant.
- Per-user vector index vs shared index. A shared index (all users’ memories in one Qdrant collection) is simpler but requires filtering by user ID at query time, which is slower and less memory-efficient. Per-user namespacing (Qdrant tenant isolation) gives perfect isolation and faster queries but more operational complexity. Use per-user namespacing.
- On-device vs server-side memory. Apple and Google push memory to the device for privacy. Server-side memory enables cross-device sync and richer models but creates a centralized data store of sensitive information. The design choice depends on the threat model. A consumer app that needs sync should use server-side with strong encryption at rest and a user-controlled key.
- Memory eviction policy. What happens when a user has 100,000 memories? Keep the most recent N by creation date? The highest-confidence ones? The most-retrieved ones (LRU-style)? LRU-style is the best answer: memories that have never been retrieved are likely to be irrelevant to future queries.
Failure modes and their mitigations
- Memory hallucination. The extractor model confidently extracts a false fact (“user is married” when the user said they were divorced). Confidence threshold filtering (> 0.7) reduces this; periodic review prompts (“I remember you mentioned X — is that still accurate?”) give the user a chance to correct. Never present a memory as a certain fact in the system prompt; frame it as “I believe you mentioned…”.
- Extraction PII leakage. Extracted memories are compressed representations of sensitive conversations; they constitute personal data under GDPR. Store them encrypted at rest with a user-specific key. Never include extracted memories in model training data without explicit consent.
- Retrieval reinforcing bias. If the assistant retrieves the same memories repeatedly, it over-applies them. A user who mentioned being vegetarian once but didn’t intend it to define every food conversation gets unsolicited vegetarian recommendations forever. Mitigation: a “memory usage” counter that tracks how many times a memory has been retrieved; memories retrieved many times have their relevance score discounted (“assumed known”) and are surfaced less aggressively over time.
- GDPR deletion cascade failure. If the deletion pipeline fails (due to a Qdrant outage), the deleted memory may still appear in the index. Mitigation: a soft-delete flag is set immediately (retrieval skips soft-deleted memories regardless of index state), followed by an async hard delete. The 30-day GDPR window provides time for retries.
What a senior candidate says that a mid-level candidate misses
The mid-level candidate describes a vector store with per-user partitions that retrieves relevant past conversation chunks. The senior candidate raises the recency × relevance scoring function as the non-trivial design problem — without the recency decay, old memories crowd out recent ones and the assistant feels like it lives in the past. They also raise the contradiction resolution problem: what happens when the user says something that directly contradicts an existing memory? The naive approach (most recent wins) is usually right but breaks for certain memory types (a user’s allergies are more durable than their mood). Expressing this as a confidence score with a type taxonomy is the senior signal.
Follow-up drills
- “A user asks ‘what do you know about me?’ — describe the end-to-end response from the retrieval system to the formatted output, and what the UX implications are.”
- “How do you prevent the assistant from inferring sensitive attributes (e.g., health conditions, political beliefs) from seemingly innocuous memories and using them in ways the user didn’t intend?”
- “The user exercises their right to erasure under GDPR. Walk me through the deletion pipeline, what is deleted, what is regenerated, and how you verify completion within 30 days.”
§127.2 Design an online-learning recommendation system
The problem
The interviewer asks: “Design a recommendation system that updates its model in real-time as users click, skip, and engage. Handle cold start for new users and items, manage the exploration/exploitation tradeoff, and make sure offline evaluation is trustworthy despite the feedback loop problem.”
What they are testing: this is the canonical recommender systems interview question, but with an online-learning twist that most candidates haven’t thought through. The hard problems are: (a) the feedback loop — the model only observes outcomes for items it recommends, so it cannot learn about items it never shows; (b) incremental updates — how do you update a model continuously without retraining from scratch? (c) counterfactual evaluation — how do you evaluate a new ranking policy without running an A/B test? These are the problems that separate production recommender engineers from researchers.
Clarifying questions
- What are we recommending? Short-form videos (TikTok-style), products, articles, songs? The engagement signal (watch time vs click vs purchase) and the item corpus characteristics (millions of items, rapidly changing) differ substantially.
- How fresh do updates need to be? Real-time (sub-second feature updates)? Micro-batch (minutes)? Daily retrain? This determines the update architecture.
- What is the cold-start definition? New user (no history), new item (just uploaded), or user × new-content-type interaction?
- What is the production serving latency target? p99 < 50 ms for ranking? p99 < 200 ms including feature retrieval?
- Is counterfactual evaluation required? This is a signal that the team is mature enough to care about offline eval validity.
- What is the exploration budget? Some fraction of recommendations must be exploratory (showing items not predicted to maximize engagement) to avoid filter bubbles and gather signal on new items.
Back-of-envelope
Item corpus: 10M items (videos or products). User base: 50M DAU. Recommendation calls: 50M × 10 recommendations/session = 500M recommendation calls per day = ~6,000 RPS.
Feature vector sizes: user embedding (64 dims, fp32) = 256 bytes; item embedding (64 dims, fp32) = 256 bytes; contextual features (time-of-day, device, recent history) = ~200 bytes. Total per recommendation: ~700 bytes. For serving latency, the model evaluation (2-layer MLP with 128-dim hidden = ~50k FLOPs) is trivial; the bottleneck is feature lookup latency (user and item embeddings from Redis or a feature store).
Incremental update throughput: 50M DAU × 10 clicks/session = 500M click events/day = ~6,000 events/sec. Each event triggers a feature update for the user and possibly the item (like-count, CTR). Feature updates must complete within the micro-batch window (1 minute) to be “fresh.”
Model update frequency: full retrain on a 50M-event corpus takes ~4 hours on a 16-GPU H100 cluster. That is too slow for a daily cycle, let alone hourly. Incremental update strategies (embedding layer updates, shallow fine-tuning on the last day’s data) can run in < 30 minutes, allowing 6 micro-cycles per day with a reasonable compute budget.
Architecture
graph LR
EVENTS["User Events\n(click · skip · watch · purchase)"] --> KAFKA["Kafka\nclick-stream topic\n~6,000 events/sec"]
KAFKA --> FEATPIPE["Feature Pipeline\nFlink: aggregate features\nper-user per-item rolling windows"]
FEATPIPE --> FEATSTORE["Feature Store\nRedis: user + item embeddings\nOnline: p99 <1ms lookup"]
KAFKA --> TRAINPIPE["Training Pipeline\nincremental: last 24h events\nRay + PyTorch"]
TRAINPIPE --> MODELREG["Model Registry\nnew embedding weights\neval gate before swap"]
MODELREG --> SERVING["Ranking Service\nembedding lookup + MLP score\ntop-k retrieval via ANN index"]
SERVING --> EXPLORE["Exploration Policy\nepsilon-greedy · Thompson sampling\ncounterfactual logging"]
EXPLORE --> RESPONSE["Ranked Results\nto client"]
KAFKA --> CFLOG["Counterfactual Log\npropensity score per shown item\nKafka + S3"]
CFLOG --> OFFLEVAL["Offline Eval\nIPS / DR estimator\nproduction policy vs candidate policy"]
style SERVING fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The counterfactual log is the evaluation infrastructure that lets you estimate a new policy’s performance before A/B testing — it is the most underbuilt component in most recommendation systems.
Feature pipeline runs in Flink with two types of features:
- User features: rolling windows (last 1h, 24h, 7d) of engagement by category, watch-time distribution, skip rate by content type. Updated on every event. Written to Redis with TTL matching the window.
- Item features: cumulative engagement statistics (CTR, average watch time, share rate), recency (age since upload), creator quality score. Updated every 5 minutes in a micro-batch Flink job.
Incremental model update uses a shallow-retrain approach. The deep embedding layers (which capture long-term preferences) are retrained daily on the full corpus. The interaction layer (which captures recent trends) is updated every hour on the last hour’s events. This splits the model into a slow-changing “base” and a fast-changing “head,” avoiding the cost of full daily retraining while keeping the ranking fresh. The architecture is the standard two-tower + interaction model: user tower and item tower produce 64-dim embeddings; a 3-layer MLP scores the concatenated user-item embedding.
Exploration policy: pure exploitation (always show the highest-predicted-engagement item) creates filter bubbles and starves new items of signal. Production policies mix:
- Epsilon-greedy: with probability ε (typically 5%), show a random item from the long tail instead of the top-k. Simple, interpretable.
- Thompson sampling: maintain a posterior distribution over each item’s CTR; sample from the posterior rather than using the point estimate. Items with wide uncertainty (few impressions) are explored more. More statistically principled but harder to implement at scale.
- Upper confidence bound (UCB): score items as
predicted_ctr + exploration_bonus, where exploration_bonus is proportional to uncertainty. Similar benefit to Thompson sampling.
At 6,000 RPS × 5% exploration = 300 exploratory requests/sec = ~26M exploratory recommendations/day. These must be logged with their propensity scores (the probability that each item was shown under the current policy) for counterfactual evaluation.
Counterfactual offline evaluation is the senior-level insight. The problem: you want to evaluate a new policy (Policy B) without deploying it. You can only observe outcomes (clicks, watches) for items that your current policy (Policy A) showed. Items never shown by Policy A have unknown outcomes. Naively comparing Policy B’s predicted outcomes to Policy A’s observed outcomes is biased: Policy B might score highly items that Policy A never showed (and which might be terrible or great in reality).
The Inverse Propensity Scoring (IPS) estimator corrects this bias. For each (user, item, outcome) event logged under Policy A, compute the propensity score p(item shown | Policy A, user). If you want to estimate what Policy B would have achieved, reweight each event by the importance weight w = P(item chosen by B) / p(item shown by A). Items Policy B would have shown more often than Policy A are upweighted; items Policy A showed frequently but Policy B ignores are downweighted. The result is an unbiased estimate of Policy B’s performance on historical data — without an A/B test.
Cold start strategies:
- New user cold start: show a mix of globally popular items (safe fallback) and items with high diversity (to gather signal across categories). After 5–10 interactions, the user embedding is initialized via a warm-up MLP that maps interaction history to the shared embedding space.
- New item cold start: extract content embeddings (text title, thumbnail image, audio spectrum) using frozen CLIP/BERT encoders. Map content embeddings to the recommendation embedding space via a learned cross-modal projection. New items start with a content-based embedding and transition to an interaction-based embedding as they accumulate clicks.
- Context cold start: a known user in a new context (e.g., first time using the desktop app). Use the user’s full embedding but boost exploration weight for the first session to gather device-specific context.
Drift detection: the incremental model update rate (hourly) is calibrated to the timescale of user preference drift. Faster drift (trending topics, viral events) requires a faster update cycle; a separate “trending signal” feature that bypasses the ranker (a lightweight heuristic boost for items with rapidly increasing CTR in the last 1 hour) handles ultra-short timescales. Slower drift (user life changes, seasonal patterns) is captured in the daily full-model retrain.
Key trade-offs
- Full daily retrain vs incremental. Full retrain is reproducible, avoids catastrophic forgetting, and is simpler to debug. Incremental is fresher but can drift in unexpected ways (the model “forgets” low-engagement patterns as it overfits to recent high-engagement items). Production choice: incremental head update hourly + full deep retrain daily.
- Two-tower model vs autoregressive session model. Two-tower (user embedding × item embedding) is fast at serving time (precompute all item embeddings; dot product at query time). Autoregressive session models (GRU4Rec, SASRec) are more accurate but slower and harder to decompose for ANN retrieval. Two-tower for the retrieval stage; a small interaction MLP for the re-ranking stage.
- Exploration rate. 5% exploration is standard but not right for every context. During off-peak hours (when user satisfaction is less critical), raise to 15%. During peak (high-engagement sessions), lower to 2%. This is the “adaptive epsilon” strategy.
- Propensity clipping. IPS estimators with very small propensity scores (p(a) → 0) produce extreme weights that dominate the estimate. Solution: clip importance weights at a maximum value (e.g., 10); accept some bias in exchange for lower variance.
- Feature freshness vs consistency. If the feature pipeline updates user features every 1 minute but the model serving cache is stale for 5 minutes, the features used at serving time don’t match the features the model was trained on. This feature drift is a quiet accuracy bug. Mitigation: use the same TTL for feature store and serving cache, or compute features inline at serving time for the most critical dimensions.
Failure modes and their mitigations
- Filter bubble reinforcement. The model only shows items similar to past engagement → user engagement with diverse content drops → the model learns to show even narrower content. Mitigation: diversity constraints in the ranking step (no more than 2 items from the same creator or category in the top-10); explicit diversity metric in the eval harness.
- Feedback loop poisoning. A bot clicks on a set of items to inflate their engagement signal. The ranker promotes those items for real users. Mitigation: bot detection upstream (session velocity heuristics, IP reputation); anomaly detection on per-item CTR spikes (flag items with > 5× their 7-day average CTR for human review).
- Incremental update overfitting. Hourly incremental updates on recent data cause the model to overweight trending content and under-weight long-tail content. Mitigation: mix 20% of historical data from a fixed reservoir (random sample of past events) into every incremental training batch. This is the “experience replay” technique from reinforcement learning.
- ANN index stale. The item embedding index (HNSW or FAISS for retrieval) is rebuilt daily, but item embeddings are updated incrementally. Between rebuilds, the index may have stale embeddings for recently uploaded items. Mitigation: a “new items” buffer (items uploaded in the last 24 hours are served by a brute-force search over their content embeddings, bypassing the stale ANN index).
What a senior candidate says that a mid-level candidate misses
The mid-level candidate designs a recommendation system that collects clicks and retrains daily. The senior candidate immediately raises the feedback loop problem and IPS counterfactual evaluation as the mechanism to evaluate new policies without A/B testing every change. This is what separates a practitioner from a researcher: the practitioner has experienced A/B test queues being backed up for months, and has learned to filter candidates with offline eval before requesting an A/B slot.
The second differentiator is the exploration budget as a product decision, not a technical one. The exploration rate determines the filter bubble vs discovery tradeoff — and different products make different choices (a social feed optimizes for engagement, a music app should optimize for discovery). The senior candidate raises this as a requirement to clarify, not a hyperparameter to tune in isolation.
Follow-up drills
- “A creator uploads a new video. Describe its journey from upload to appearing in users’ recommendation feeds, including the cold-start handling and the time it takes for the item to get a real engagement signal.”
- “You run an IPS counterfactual evaluation and find that Policy B looks 8% better than Policy A. You deploy it. The actual A/B test shows Policy B is 2% worse. What happened?”
- “How would you handle recommendations across multiple surfaces (home feed, search, notifications) with different engagement patterns and latency budgets?”
§127.3 Design a multimodal chat backend
The problem
The interviewer asks: “Design the backend for a multimodal chat assistant that accepts text, images, audio, and video as inputs and generates text (and optionally images) as outputs. Handle streaming, multimodal safety, latency budgets across modalities, and cost attribution per modality.”
What they are testing: this is the hardest design problem in this book. The interviewer is looking for a candidate who understands that multimodality is not “add an image encoder” to a text chat backend. It is five separate encoding pipelines, a shared context window with a budget that must be managed across modalities, safety checks at each modality layer, a latency budget that audio and video latency can easily blow, and cost accounting that needs to track $/token equivalent across radically different media types.
Clarifying questions
- Which modalities in, which out? Text, image, audio, video in; text out is the baseline. Image generation out (like DALL-E) is a separate serving pipeline. Clarify whether the scope includes output generation of non-text modalities.
- What is the context window budget for each modality? A 128k-token context window is shared between text conversation history and multimodal tokens. An image is typically 256–1024 visual tokens; a 1-minute audio clip might be 750 tokens; a 30-second video at 1 fps is 30 frames × 256 tokens = 7,680 tokens. This budget math shapes the entire architecture.
- What is the latency target? Text chat: TTFT < 500 ms. With a 1080p image in the input, the image encoding step adds 100–300 ms before the LLM even starts. With a 30-second video, encoding adds 5–15 seconds. Are those latencies acceptable?
- What is the audio input path? Streaming audio (user is talking) or uploaded audio clip? Streaming requires ASR (automatic speech recognition) in the pipeline; uploaded audio can be processed offline.
- Safety per modality? Each modality needs its own safety layer (NSFW image detection, audio deepfake detection, harmful video content detection). Do they run in parallel or serial?
- Cost attribution model? How does a video input get charged? Per frame? Per second? Per visual token?
Back-of-envelope
Context window token budget:
- Text conversation history: 2,000 tokens (typical for a multi-turn chat)
- Image (1080p, 336×336 resized): ~576 visual tokens (24×24 tiles at the ViT patch size)
- Audio (1 minute at Whisper-large): ~750 tokens equivalent after ASR transcription, or ~600 mel-spectrogram tokens if processed as audio embeddings
- Video (30 seconds at 1 fps): 30 frames × 576 tokens = 17,280 tokens — nearly the entire context budget for a 32k model. 128k context can handle ~7 minutes of video at 1 fps.
Encoding latency:
- Image (336×336, ViT-L): ~50 ms on A100 GPU
- Audio (1-minute clip, Whisper-large): ~8 seconds on A100; ~2 seconds on H100 with Flash Attention
- Video (30 seconds, 1 fps, ViT-L per frame): 30 frames × 50 ms = 1.5 seconds — clearly not on the synchronous request path; must be async or use frame sampling
Modality cost:
- Text: $0.45/M tokens (self-hosted 70B)
- Image: visual tokens are billed the same as text tokens, but there are 576 per image → ~$0.0003 per image at self-hosted rates
- Audio: $0.006/minute (Whisper API pricing reference) or ~0.5× the token equivalent cost if processed as audio embeddings
- Video (30s at 1 fps): 30 × 576 = 17,280 tokens × $0.45/M = $0.008 per 30-second clip
Architecture
graph LR
CLIENT["Client\n(text + image + audio + video)"] --> GW["Multimodal Gateway\ncontent-type routing\nsize validation · rate limit"]
GW --> TENC["Text Tokenizer\n+context assembly"]
GW --> IENC["Image Encoder\nViT-L / CLIP\nA100 pool"]
GW --> AENC["Audio Encoder\nWhisper ASR → text tokens\nor mel-spec embeddings"]
GW --> VENC["Video Encoder\nframe sampling · ViT-L per frame\nasync job for long clips"]
TENC --> CTX["Context Assembler\ntoken budget manager\nunified context vector"]
IENC --> CTX
AENC --> CTX
VENC --> CTX
CTX --> SAFETY["Multimodal Safety\ntext: Llama Guard\nimage: NSFW classifier\naudio: deepfake detect\nvideo: per-frame NSFW (sampled)"]
SAFETY --> LLM["LLM Serving Fleet\nMultimodal LLM\nvLLM + TP=2 · SSE stream"]
LLM --> OSAFETY["Output Safety\ntext output scan"]
OSAFETY --> METER["Cost Meter\nper-modality token count\nKafka billing event"]
style CTX fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The context assembler is the budget manager: it enforces the token window allocation across modalities, applies frame sampling for video, and rejects inputs that would exceed limits before any expensive encoding runs.
Modality encoding pipelines are four separate GPU pools:
-
Image encoder: ViT-L (Vision Transformer Large) from a CLIP model, producing 576 image tokens per image (14×14 patches over a 336×336 pixel input). Runs on A100s. The image is resized and normalized before encoding; aspect-ratio-preserving resizing with padding is used to minimize distortion. Multiple images in one request are batched together across the pool.
-
Audio encoder: two options depending on use case. (a) ASR path (Whisper-large): transcribes audio to text, which is then processed as text tokens. This is cheaper and simpler but loses acoustic information (tone, emotion, non-speech sounds). (b) Audio embedding path: mel-spectrogram processed by a dedicated audio transformer that produces audio tokens. This preserves acoustic context and is necessary for tasks where tone matters (voice assistants, music queries). Default: ASR for general chat, audio embeddings for voice-assistant use cases.
-
Video encoder: the most expensive modality. A 30-second video at 30 fps is 900 frames. Full encoding (ViT-L per frame) would take 900 × 50 ms = 45 seconds — totally unacceptable for an interactive chat. Solutions:
- Frame sampling: encode 1 frame per second (30 frames for 30 seconds). Visual quality loss is acceptable for most understanding tasks.
- I-frame only: encode only keyframes (I-frames in the H.264/H.265 stream). For typical video, 1 I-frame per 2–5 seconds. Reduces encoding to 6–15 frames for 30 seconds.
- Async preprocessing: for long videos (> 30 seconds), the encoding job runs asynchronously. The client receives a job ID, polls for completion, and then the completion result is used as context in a follow-up request.
- Lightweight video encoder: a purpose-built video transformer (VideoMAE, ViViT) processes the full video as a sequence, using temporal compression to reduce token count without frame sampling. Slower to encode but richer representations.
-
Text tokenizer is trivial: standard tokenizer for the underlying LLM. The only complexity is multi-turn context assembly: previous turns are included up to the remaining context budget after multimodal tokens are reserved.
Context budget manager is the critical orchestration piece. Before sending anything to GPU encoders, the budget manager:
- Counts input text tokens (~100 ms, CPU).
- Estimates visual token count for images and video frames (based on resolution and frame count, pre-encoding).
- If the total exceeds the context limit, applies a degradation policy: (a) reduce video to I-frames only; (b) reduce image resolution (280×280 → 168×168 → 84×84, each step halving visual tokens); (c) truncate conversation history (oldest turns first); (d) return HTTP 400 with a descriptive error if even the degraded inputs exceed the limit.
- If the total fits, dispatch all encoding jobs in parallel.
The parallelism is critical: image + audio encoding in parallel saves 50–300 ms vs serial. The context assembler waits for all encoding jobs to complete before constructing the final context vector and sending to the LLM.
Multimodal safety runs at each layer:
- Image: a CLIP-based NSFW classifier (~10 ms on A100) on every image input. Threshold is configurable; default rejects at “explicit” NSFW level, not at “suggestive” (reducing false positives).
- Audio: a deepfake/voice-clone detector on audio inputs (important for authentication or sensitive contexts). Latency: ~50 ms. An abuse classifier on the ASR transcript (same Llama Guard path as text).
- Video: NSFW classification on sampled frames (1 frame per 5 seconds, parallelized). Video deepfake detector for video calls or identity-sensitive contexts.
- Text output: same post-filter as the baseline text chatbot (§116.5).
All safety checks run in parallel with the encoding pipeline where possible. The safety check for an image runs concurrently with ViT-L encoding of the same image; if safety check fails, the LLM call is never made.
Streaming audio with incremental ASR addresses the audio latency problem. Instead of waiting for a complete audio clip, the client streams audio chunks (100 ms segments) as the user speaks. The gateway passes each chunk to a streaming Whisper server (faster-whisper with streaming mode, or a custom CTC-based ASR). Partial transcriptions are assembled into a growing text buffer. When the user stops speaking (detected by a VAD — voice activity detection — model), the final transcript is assembled and the LLM call begins. This reduces latency from “audio duration + processing time” to “processing time of the last chunk” — roughly 200–500 ms of lag instead of 2+ seconds.
Interleaved tool calls with image output: when the LLM generates a function call to a DALL-E-class image generation service (a common pattern in multimodal agents), the image is generated asynchronously, and a placeholder token is emitted in the SSE stream. When the image is ready, a separate SSE event delivers the image data. The client renders the placeholder first (a loading spinner), then replaces it with the image. This requires the gateway to understand the multimodal SSE protocol and maintain a per-request state machine.
Cost attribution per modality tracks inputs separately:
- Text tokens: counted from the tokenizer output.
- Image tokens: 576 tokens per full-resolution image, scaled proportionally for lower resolutions.
- Audio tokens: tokens equivalent to the ASR transcript length (fair pricing for the compute used).
- Video tokens: frame count × tokens per frame, capped at the max context window budget.
Each modality’s token count is written to the Kafka billing event separately. The billing dashboard shows monthly spend broken down by modality: “Team A spent $12,000 total: $8,000 text, $3,000 image, $1,000 video.” This enables per-modality cost optimization.
Key trade-offs
- Shared context window vs modality-specific context. A shared token budget (all modalities compete for the same 128k context) is simple to implement and is how production multimodal LLMs (GPT-4V, Gemini 1.5) work. Modality-specific context (image tokens in a separate memory, text in another) is more complex but allows longer effective context for each modality. The shared budget is the correct first implementation.
- ASR path vs audio embedding path. ASR is cheap, fast, and produces tokens the LLM was trained on. Audio embeddings preserve acoustic information but require a multimodal LLM trained to consume both. For most chat use cases, ASR is sufficient and much simpler.
- Frame sampling vs full-video encoding. Frame sampling loses temporal information (motion, fast changes). For understanding tasks (describe what’s in this video), 1 fps is usually sufficient. For action recognition or fine-grained temporal analysis, you need either more frames or a temporal video encoder. Start with frame sampling; add temporal encoding only when accuracy on temporal queries is demonstrably poor.
- Safety in series vs parallel. Running safety checks in parallel with encoding (not waiting for encoding to finish before starting safety) saves 50–100 ms but means the LLM may start encoding a modality that safety will later reject. Mitigation: run safety checks on raw input (image before encoding, audio before ASR) — they are cheaper and faster on raw data anyway. If safety passes the raw input, proceed with encoding; if it fails, abort immediately.
- Async video vs synchronous video. Async video (clip is submitted as a job, result is polled or delivered via webhook) is necessary for clips longer than ~15 seconds. Synchronous video is acceptable for short clips (< 15 seconds, < 10 frames after sampling). The threshold is a product decision that depends on acceptable user wait time.
Failure modes and their mitigations
- Audio encoding GPU pool exhaustion. Audio is slow; at peak, the audio encoder pool is saturated and requests queue. Mitigation: audio requests are the most latency-sensitive and the most expensive; autoscale the audio encoder pool aggressively on
audio:queue_depth. Alert at depth > 20 requests. - Context budget overflow at encoding time. A user uploads a 10-minute video. The context estimator says it fits (128k tokens for 1 fps = 7.5 minutes), but at 1 fps the full video is 600 frames × 576 tokens = 345,600 tokens — way over. Mitigation: the context manager re-estimates at lower frame rates until it fits; if even I-frames don’t fit, return HTTP 400 with a helpful message (“Video exceeds context limit; try a clip shorter than 3 minutes”).
- Safety false positives on image. An NSFW classifier that flags medical diagrams, art, or news photography frustrates users. Mitigation: multi-classifier cascade (fast binary NSFW classifier first; if flagged, a more accurate classifier second); context-aware safety (if the user is a verified medical professional, adjust the safety threshold).
- Streaming ASR transcription latency spike. The ASR server is slow for long or noisy audio clips, adding multi-second latency to the LLM start. Mitigation: a parallel path — begin the LLM call with partial ASR transcript while ASR continues; the LLM generates a prefix response that can be extended when the full transcript arrives. This is complex but reduces perceived latency.
- Multimodal output image generation failure. If the DALL-E-class image generation service is unavailable, the LLM’s tool call fails. Mitigation: the gateway handles the tool failure gracefully — it sends the LLM an error observation (“image generation failed”) and the LLM produces a text-only response.
What a senior candidate says that a mid-level candidate misses
The mid-level candidate adds an image encoder to their existing text chatbot. The senior candidate immediately identifies the context window budget problem as the central design challenge: different modalities consume context at wildly different rates, and the system must make intelligent tradeoffs (lower frame rate, lower image resolution, shorter history) before encoding begins — not after. The budget manager running on pre-encoding estimates is the correct architecture; discovering the context overflow after spending 1.5 seconds encoding video frames is a waste.
The second differentiator is streaming audio with VAD-triggered LLM calls. Most candidates describe uploading an audio clip and waiting for ASR to complete. The senior candidate describes the streaming pipeline: VAD detects speech boundaries, incremental transcription accumulates tokens, the LLM is started as soon as the VAD detects silence. This is how production voice assistants (Siri, Google Assistant, Alexa) work, and knowing it signals real production experience.
Follow-up drills
- “A user sends a 10-minute video. Walk through the full request lifecycle from upload to first streamed token, including all the decision points where the system might fail or degrade gracefully.”
- “How do you handle a user asking a question that is answered by a specific 3-second segment of a 10-minute video? How do you surface the timestamp to the user in your response?”
- “Design the cost attribution system for a team that sends mostly video inputs. What metrics do you surface so the team can understand and optimize their multimodal costs?”
Read it yourself
- The MemGPT paper (Packer et al., 2023) — the original framing of tiered memory for LLM agents; the two-tier hierarchy in §127.1 is directly inspired by it.
- The Zep memory store (getzep.com) and Mem0 documentation — production implementations of conversational memory extraction; reading their API design reveals the practitioner tradeoffs.
- Schnabel et al., “Recommendations as Treatments: Debiasing Learning and Evaluation” (ICML 2016) — the foundational IPS paper for counterfactual recommendation evaluation.
- Joachims et al., “Unbiased Learning-to-Rank with Biased Feedback” (WSDM 2017) — the companion paper on position bias correction, which pairs with IPS in every production ranker.
- The TikTok recommendation system technical blog (no single paper; see the ACM RecSys 2021 keynote) — the best public description of online learning at recommendation scale.
- OpenAI’s GPT-4V system card — the only public detailed discussion of multimodal context window budgeting and safety for vision-language models.
- The Whisper paper (Radford et al., 2022) — the underlying ASR architecture; the streaming capabilities of faster-whisper are built on top of this.
- The VideoMAE paper (Tong et al., 2022) — the state-of-the-art efficient video encoder; the temporal compression approach discussed in §127.3 is from this line of work.
- GDPR Article 17 (“Right to Erasure”) and the UK ICO guidance on AI and data retention — the regulatory context for the deletion design in §127.1.
Practice
- Design the memory extraction prompt for a system that needs to extract facts from a multi-turn conversation where the user discusses both work (engineering manager at Stripe) and personal topics (dog named Mochi, planning a trip to Japan). What JSON schema would you use for the extracted facts?
- For the online-learning recommender, compute the IPS weight for an item that your current policy shows with 10% probability, and that the candidate policy would show with 30% probability. The observed click outcome was 1. What is the weighted outcome used in the IPS estimator?
- Estimate the daily token budget consumption for a multimodal chat assistant serving 1M DAU, where 40% of users upload one image per session (average 576 visual tokens) and 10% send a 30-second audio clip. Total token budget per day? At self-hosted 70B rates, what is the monthly cost?
- A user sends a video that the context manager estimates at 15,000 visual tokens, but the conversation history is already 12,000 text tokens in a 32k context window. Describe the degradation policy choices and their quality implications.
- Stretch: Design the streaming SSE protocol for a multimodal chat assistant that can interleave text tokens and generated images in the same stream. Specify the event types, their payloads, and how the client renders partial content while waiting for image generation to complete.