Design a real-time recommendation system
"The LLM revolution didn't kill the two-tower model. It just made it pair well with one"
Recommendations are the oldest large-scale ML product and still the most economically important. Every senior ML systems loop asks some version of “design a recommender” — for videos, for products, for news, for jobs. The interesting parts are not the model architecture (two-tower retrieval plus a ranker is a well-known template) but the system: feature stores, online learning loops, cold-start handling, and the interplay between the retrieval layer and the ranking layer under tight latency budgets.
This chapter works the canonical version: a real-time recommendation system for a consumer feed — videos, posts, or products — where the interesting axis is the serving architecture, not the modeling choice. The LLM era has added new components (LLM-as-ranker, LLM-augmented retrieval), but the skeleton is unchanged.
Outline:
- Clarify: what “real-time” means and why.
- Estimate: from users to QPS to feature store size.
- High-level architecture — the two-tower + ranker split.
- Drill 1: retrieval with the two-tower model.
- Drill 2: the ranking layer and feature store.
- Online learning and the feedback loop.
- The cold-start problem.
- Where LLMs do and do not fit.
- Evaluation, metrics, and drift.
- Tradeoffs volunteered.
- The mental model.
119.1 Clarify: what “real-time” means and why
The clarify phase for recommendation problems has its own shape.
1. What’s the product surface? Interviewer says: a video feed, one video at a time, for a social platform. Each impression is a (user, video) pair; the model ranks candidates and the top one is shown.
2. What does “real-time” mean? Three possible answers and they’re all different:
- Real-time serving: the ranking happens at request time with current features. Always true for a feed.
- Real-time features: the features reflect events within the last few seconds (user watched a video 2 s ago, it must influence the next recommendation).
- Real-time learning: the model updates from new interactions within minutes (not nightly).
The interviewer says all three matter. Features must reflect events within 5 seconds; the model must update at least hourly.
3. What’s the scale? 50M DAU, ~100 videos watched per user per day, so ~5B impressions/day. Peak ~250k impressions/sec.
4. What’s the latency target? End-to-end feed load p95 < 150 ms. Each impression must be ranked and returned to the client within that budget, including retrieval + ranking + feature fetch.
5. What’s the candidate corpus? ~100M videos, growing at ~1M/day. Each video has metadata (creator, tags, duration), content embeddings, and interaction features (view count, watch time histogram).
6. What’s the optimization target? Interviewer says total watch time, with secondary objectives on diversity and fairness (prevent runaway engagement on any single creator or topic). This is a multi-objective problem, which affects the ranker design.
119.2 Estimate: from users to QPS to feature store size
The chain:
- DAU: 50M.
- Impressions per user per day: 100.
- Total impressions/day: 5B.
- Average impression rate: 5B / 86,400 ≈ 58k/sec.
- Peak (×4): ~250k/sec.
- Candidates per ranking: typically retrieval returns 500–2000 candidates per impression.
- Ranker scores per second: 250k × 1000 = ~250M ranker scores per second at peak.
At 250M scores/sec, the ranker must be very cheap. Two options: (a) a small neural ranker (~1 ms per batch of 1000 on CPU, ~250k batches/sec requires a lot of CPU), or (b) a tree ensemble (LightGBM/XGBoost) at ~0.1 ms per instance. Most production rankers are GBDTs for this exact reason. A modern trend: a small cross-feature neural model on a cheap accelerator (T4 or A10) for the top 500 candidates, GBDT for the broader funnel.
Feature store.
- User features: 50M users × ~200 features × 4 bytes ≈ 40 GB. Fits in RAM (Redis or an embedded in-memory store).
- Video features: 100M videos × ~500 features × 4 bytes ≈ 200 GB. Too big for a single Redis node; sharded across 4–8 nodes.
- Interaction features (rolling counters): per-video watch counts in different time windows (1h, 24h, 7d). Another ~50 GB.
- Real-time features: last-watched video, recent impression list, current session state. Session store, ~100 GB for 50M active sessions at ~2 KB each.
- Feature store read QPS: 250k impressions × (1 user lookup + ~500 candidate video lookups) = ~125M feature reads/sec at peak. That’s high. Aggressive caching and fan-out optimizations are required.
GPU/CPU fleet for retrieval + ranking.
- Retrieval (two-tower ANN): ~100 ms p99 for top-1000 from 100M candidates on a well-tuned HNSW ≈ 10k QPS per core. At 250k/sec, ~25 cores. In practice, 4–8 nodes with 32 cores each.
- Ranking: GBDT inference at 0.1 ms per candidate per core ≈ 10k ranks/sec/core. 250M scores/sec / 10k = 25k cores. That’s a lot — in practice, rankers run on many replicas across many nodes, and the cost is amortized across multiple surfaces.
Monthly cost: feature store ~$5k (RAM-heavy), retrieval ~$10k, ranking CPU fleet ~$30k, online learning pipeline ~$5k, telemetry ~$5k. Total: ~$55k/month for compute, small compared to the chatbot or RAG designs.
119.3 High-level architecture — the two-tower + ranker split
[ client (feed UI) ]
|
v
[ feed service (gateway + auth) ]
|
v
[ request orchestrator ]
|
v (parallel fan-out)
+------------------------------+
| |
v v
[ retrieval layer ] [ feature fetcher ]
- two-tower ANN - user features (Redis)
- candidate sources: - session state (Redis)
* personalized ANN - real-time counters
* trending
* social graph
* recency
- returns ~1000 candidates
| |
+--------------+---------------+
|
v
[ feature hydration ]
- batch fetch video features (500 × N_sources)
- join user features
- compute cross-features (dot products, etc.)
|
v
[ ranker ]
- GBDT (first pass, all candidates)
- neural ranker (top 200)
- multi-objective score
|
v
[ diversification + business rules ]
- creator cap, topic diversity
- age-restricted filtering
- blocklists, ads integration
|
v
[ top-1 (or top-k) result ]
|
v
[ response to client + impression log ]
|
v
[ event stream (Kafka) ]
| | |
v v v
[ feature [ online [ training
store learning data
update ] update ] warehouse ]
graph LR
Client["Feed UI"] --> FS["Feed Service\nGateway + auth"]
FS --> ORCH["Request Orchestrator"]
ORCH --> RET["Retrieval Layer\nTwo-tower ANN"]
ORCH --> FF["Feature Fetcher\nRedis: user + session"]
RET -->|"~1000 candidates"| FH["Feature Hydration\njoin user × item features"]
FF --> FH
FH --> RANK["Ranker\nGBDT first pass → neural top-200"]
RANK --> DIV["Diversification\ncreator cap · topic diversity"]
DIV --> TOP["Top-1 (or top-k)"]
TOP --> LOG["Impression Log\nKafka"]
LOG --> FST["Feature Store Update\nFlink → Redis"]
LOG --> OL["Online Learning\nwarm-restart every 15 min"]
LOG --> DW["Training Warehouse\nnightly retrain"]
style RANK fill:var(--fig-accent-soft),stroke:var(--fig-accent)
The feedback loop (impression log → feature store → online learning) is what makes this “real-time” — without it, a recommendation system is just a batch ranker served fast.
Technologies:
- Feed service: Go or Java service behind the standard gateway.
- Two-tower: trained offline, served via a custom inference service with HNSW for the item tower’s vectors.
- Feature store: Feast or Tecton for offline/online parity (Chapter 91), backed by Redis for online, BigQuery/Snowflake for offline.
- GBDT ranker: LightGBM or XGBoost, served via an internal prediction service. An alternative: a small PyTorch neural ranker on CPU.
- Online learning: periodic mini-batch updates to the ranker on a 15-minute cadence; full retrain nightly.
- Event stream: Kafka with exactly-once semantics, partitioned by user_id for ordering.
Interviewer says “drill into retrieval.”
119.4 Drill 1: retrieval with the two-tower model
The job. Given a user ID, return ~1000 candidate videos likely to be watched, across multiple candidate sources, within 30 ms p95.
The two-tower model. Two neural networks: one that encodes the user into a vector (the “user tower”) and one that encodes the video into a vector (the “item tower”). At training time, both are jointly optimized with a contrastive loss (in-batch negatives) to make the user vector close to the videos the user watched and far from the ones they didn’t. At serving time:
- The user vector is computed at request time from the current user features (recent interactions, session state, profile).
- The item vectors are precomputed for all 100M videos and stored in an HNSW index.
- An ANN search returns the top-K nearest videos to the user vector.
This is the standard architecture from YouTube’s “Recommending What Video to Watch Next” paper (Covington et al. 2016) and has been the dominant retrieval pattern ever since.
Why two-tower, not a deep cross-network retriever? Because the item tower’s vectors can be precomputed.
User tower inference. A small MLP, runs in <5 ms on CPU per user request. Features include aggregated watch history, demographic (where available), session context, and the last 50 impressions.
Item tower refresh. The item vectors are recomputed when the model changes (maybe weekly) and when item features shift (new view counts, new tags). A hybrid approach: recompute on item event, batched every hour. The HNSW index supports incremental insertion, so new videos are added immediately after their first embedding.
Candidate sources. Personalized ANN is one of several candidate generators. Others:
- Trending: top-N globally or per region, refreshed every few minutes.
- Social graph: videos watched by the user’s follows.
- Recency: recent uploads from creators the user interacts with.
- Query-like: if the user searched recently, videos matching the query.
Each source returns ~100–500 candidates. The orchestrator merges and deduplicates. Multi-source retrieval is important for diversity and for handling cold users (when personalization is weak, trending and social sources carry the weight).
ANN parameters. For 100M vectors at 256 dim (typical for retrieval two-towers), raw size ~100 GB, HNSW with ef_search=64 gives ~95% recall and ~5 ms per query on a single core. Sharded across 4 nodes for throughput and redundancy.
Filtering. Candidates must be filtered for eligibility: not already watched recently, not blocked, not age-restricted, respects locality. These filters are applied post-retrieval because they’re dynamic. Post-filtering can destroy recall if filters are selective; for strict filters (e.g., age), the candidate set is pre-partitioned so the ANN only searches within the eligible subset.
119.5 Drill 2: the ranking layer and feature store
The ranker. After retrieval returns ~1000 candidates, the ranker scores them and picks the top-1 (or top-K for the feed). Ranker architecture is typically two stages:
- First-pass GBDT over all 1000 candidates. ~50–100 features per candidate. Scores fast — 0.1 ms per candidate on one CPU core. Outputs a coarse ranking.
- Second-pass neural model over the top-200. Deeper features, cross-features (user × item interactions), attention over recent history. Runs at 2–5 ms per batch on CPU or small GPU.
The two-pass approach lets the system spend compute where it matters — the top of the ranking.
Features used in the ranker:
| Type | Examples |
|---|---|
| User features | age bucket, region, language, recent watch categories |
| Item features | category, duration, creator, freshness, quality score |
| User-item cross | click-through rate for this user-item category, past watch history with this creator |
| Context | time of day, device, connection speed |
| Real-time | last 5 watched items, session length, current scroll depth |
The feature store must serve all of these within the latency budget.
Feature store architecture. Feast-style: dual offline/online stores that share a feature definition. Offline store (BigQuery) is used for training-data generation. Online store (Redis) is used for serving. The critical property is train/serve consistency: a feature’s value at training time must match its value if we had queried it at that moment in production. Without this, the trained model makes decisions on stale or shifted features.
Online store latency. Redis cluster, 5–8 nodes at 256 GB RAM each. Reads at <1 ms per key. Pipelined batch reads (mget) for fan-out. For 500 candidates × 10 features each = 5000 feature reads per impression, batched into ~5 Redis mget calls, total ~3 ms.
Real-time features. The trickiest. A feature like “user’s last watched video ID” must reflect events within seconds. Pattern: the event stream (Kafka) is consumed by a stateful processor (Flink or Kafka Streams) that updates Redis keys as events arrive. Median update latency: ~1 second. Ordering is preserved by partitioning by user_id.
Feature freshness SLI: p95 of (feature_update_time - event_time). SLO: <5 seconds.
The offline/online skew problem. A feature computed offline with a 1-day rolling window might not match the online version computed with a 24-hour sliding window because of subtly different semantics (e.g., calendar day vs exact 24h). This mismatch destroys model quality silently. Mitigation: feature definitions are code, not SQL, and the same code path is used for offline and online computation. Feast and Tecton are both built around this principle.
119.6 Online learning and the feedback loop
The model must update from new interactions within minutes — 50M users × 100 impressions × implicit feedback (watched / skipped / watched-duration) is a lot of signal. A nightly retrain wastes most of it.
Streaming updates. A mini-batch of new (impression, outcome) pairs is written to a training queue every few minutes. A dedicated training job consumes the queue, updates the ranker, evaluates on a holdout, and deploys if the eval passes.
Update cadence:
- User tower: updated every few hours with a lightweight fine-tune.
- Item tower: recomputed when embeddings shift; items get new vectors hourly.
- GBDT ranker: warm-restart training every 15–60 minutes, using only the last hour of data.
- Full retrain: nightly, on the full week of data.
Training pipeline. Spark or Flink for feature extraction, TorchRec or TensorFlow for model training, Ray for orchestration. The training data warehouse is the source of truth; the online stream is the incremental signal.
Counterfactual logging. Ranking models suffer from the “bandit problem”: the model only learns about items it showed, not items it didn’t. To fix this, production logs include the full ranked list with impression weights, so counterfactual training (inverse propensity scoring) can correct for selection bias.
The exploration tax. A fraction of traffic (1–5%) is deliberately served diverse or random items, to collect training data for the bandit update. This costs short-term engagement for long-term model quality.
119.7 The cold-start problem
Two flavors: cold users and cold items.
Cold users. A new user has no history. The user tower’s input features are mostly zero. The model falls back to demographic features and default priors. Mitigation strategies:
- Onboarding signal: ask the user to pick 3 topics during signup.
- Trending + social: weight trending and social graph sources higher for new users; weight personalized ANN lower.
- Rapid adaptation: update the user tower after each impression for the first session. A “fast learner” model specifically for users with <100 impressions.
- Exploration: serve more diverse content to cold users to learn their preferences faster.
Cold items. A new video has no interaction history. The item tower has only content features to work with (title, tags, creator, visual embeddings from a content model). Mitigation:
- Content-based initial embeddings: use a pre-trained visual encoder (CLIP-style) to produce the first item vector from the video itself, not from interactions.
- Creator features: inherit some signal from the creator’s historical engagement.
- Explicit new-content boost: new items get a temporary boost in the ranker to accumulate signal faster.
- Multi-armed bandit allocation: new items are allocated impressions via UCB or Thompson sampling until they have enough signal.
Why this matters. Without explicit cold-start handling, new items never get shown, and the catalog stagnates. Every senior interviewer asks about cold items because candidates who’ve only read the papers miss it.
119.8 Where LLMs do and do not fit
Two places LLMs show up in modern recommenders, and two places they don’t.
Where they fit:
-
Semantic item encoders. Instead of training an item tower from scratch, use a pretrained text or visual encoder (like CLIP or a frozen LLM embedding) to produce high-quality item vectors. This is especially powerful for cold items and for heterogeneous content. The item tower becomes a thin adapter on top of the pretrained encoder.
-
LLM-as-ranker for the top-K. After retrieval returns ~1000 candidates and the first-pass ranker narrows to top-20, a small LLM can rerank with explicit reasoning over user history and item descriptions. Costs ~200 ms per impression, so only viable at the very top of the funnel, not for every candidate. The gain is modest but measurable on specific surfaces (news, long-form content). For endless-scroll video feeds, the latency cost usually isn’t worth it.
Where they don’t fit:
-
As a primary retriever at scale. Running an LLM over 100M candidate items per impression is economically impossible. Retrieval must remain an ANN over compact vectors.
-
As a replacement for the GBDT. GBDTs beat neural models on tabular features by wide margins in production recommenders (and the gap stays stable over years). LLMs don’t close this gap.
The senior candidate’s summary: “LLMs are adjuncts to the classical recsys stack, not replacements. They add signal at the edges — cold-start items, semantic diversity, explanation generation — but the core of retrieval + ranking stays as it’s been for a decade.”
119.9 Evaluation, metrics, and drift
Offline metrics: AUC, logloss, NDCG@k on a held-out day of impressions.
Online metrics: the only ones that matter in production.
- Watch time per session — primary.
- CTR (click-through rate) — secondary; watch out for engagement without depth.
- Completion rate — did users finish the videos?
- Diversity — entropy across topics and creators shown.
- Retention — did users come back next day, next week, next month?
- Fairness — impression share across creator segments.
A/B testing infrastructure. Every model change is A/B tested with ~5% of traffic for 1–2 weeks before rollout. Minimum detectable effect calculated per metric; noisy metrics (retention) need larger samples. This is expensive but non-negotiable — shipping a worse model by accident is the single biggest risk.
Drift detection. The model’s prediction distribution, the feature distributions, and the realized engagement distribution are all monitored daily. Drift alarms fire when distributions shift more than expected. Drift is often caused by: a bug in the feature pipeline, a seasonal shift, an adversarial attack on engagement, or a change in the user base.
Feedback loops. The biggest danger. If the model learns “users watch clickbait → show more clickbait,” the system degrades over time even as metrics look fine short-term. Mitigation: holdout cohorts that receive less-personalized content; cross-checking with surveys and delayed metrics (30-day retention).
119.10 Tradeoffs volunteered
- Two-tower vs cross-network retrieval: two-tower for scale; cross-network if the candidate pool is small enough (<1M).
- GBDT vs neural ranker: GBDT for tabular features and throughput; neural ranker for embedding-heavy features and the top of the funnel.
- Feast vs Tecton vs custom feature store: Feast for open source and flexibility; Tecton for managed operations; custom if there’s a strong reason.
- Flink vs Kafka Streams: Flink for complex state and large aggregations; Kafka Streams for simpler cases.
- Online learning cadence: every 15 minutes for responsive models; hourly if instability costs more than freshness gains.
- Exploration fraction: 1% for mature models, 5–10% for early-stage, up to 20% for cold starts.
- LLM reranker for top-20 vs classical neural ranker: LLM if the surface cares about reasoning and can afford the latency; classical otherwise.
- Per-user personalization vs cohort-based: per-user for mature users; cohort (new-user clusters) for cold users.
119.11 The mental model
Eight points to take into Chapter 120:
- Two-tower retrieval + cross-feature ranker is still the template. LLMs augment, not replace.
- The feature store is the hinge. Train/serve consistency is the make-or-break property.
- Real-time features need a stateful stream processor. Flink or equivalent, partitioned by user_id, with sub-5-second update SLO.
- Cold start has two flavors — users and items — and needs explicit handling for each. Skipping it causes catalog stagnation.
- Online learning cadence matters. 15-minute updates capture signal that nightly retrain misses.
- Multi-objective ranking is the norm. Watch time, diversity, fairness, and business rules all combine.
- LLMs fit as semantic encoders and top-of-funnel rerankers, not as primary retrievers or base rankers.
- A/B testing is the gold standard. Offline metrics lie; online metrics are the source of truth. Drift detection catches the rest.
Chapter 120 is the meta-question that ties together everything: designing a multi-tenant model serving platform that hosts all of these kinds of workloads.
Read it yourself
- Covington, Adams, and Sargin, Deep Neural Networks for YouTube Recommendations (2016). The foundational two-tower paper.
- Eksombatchai et al., Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time (Pinterest). For the engineering side.
- The Feast project documentation and the Tecton blog posts on feature stores.
- Covington et al.’s YouTube paper and Guo et al.’s DeepFM for model architectures.
- Eric Bernhardsson’s talks on recommendation systems at Spotify and Better — for the engineering reality.
- “Delayed Feedback Model for Continuous Training” (Criteo) — for online learning with delayed signals.
Practice
- Estimate the ranker compute for 100k impressions/sec, 500 candidates each, GBDT at 0.1 ms per score. How many cores?
- Design the real-time feature update path for “user’s last watched video” with a 5-second freshness SLO. Include Kafka, Flink, Redis, and failure modes.
- A new user signs up. Walk through the first 10 impressions they see, naming which candidate source dominates at each step.
- The item tower model changes. Design the incremental re-embedding strategy to avoid a cold-cache cliff.
- An A/B test shows the new model wins on watch time but loses on retention. What do you do?
- Add an LLM-based reranker for the top 20 candidates. Walk through the latency budget and cost impact.
- Stretch: Extend the design to a multi-surface platform (feed + search + notifications + ads) sharing the same feature store and retrieval fleet. What changes, what breaks, and what new services emerge?