Chapter 117: Design a RAG system over 10TB of documents

The second worked design problem. “10 TB of documents” is the retrieval-heavy counterpart to Chapter 116’s chatbot. It forces the candidate away from pure LLM-serving concerns and into the full stack of Part IV — ingestion, chunking, embedding, indexing, sharding, hybrid search, reranking, context construction, evaluation, and freshness. The candidate who answers well cross-references Chapters 55–63 constantly, and volunteers the tradeoffs that only come from having built one of these.

This chapter, like Chapter 116, is a transcript-style walk through the interview framework. The illustrative choices are one reasonable design; every significant decision has a named alternative.

Outline:

Clarify: the 10TB is ambiguous on purpose.
Estimate: from document count to index size to node count.
High-level architecture — the full RAG pipeline.
Drill 1: the retrieval layer.
Drill 2: the ingestion and reindex pipeline.
Hybrid search and reranker placement.
Context packing and the LLM call.
Evaluation, freshness, and staleness.
Operations, failure modes, and cost.
Tradeoffs volunteered.
The mental model.

117.1 Clarify: the 10TB is ambiguous on purpose

The candidate opens with the clarify phase. “10 TB” is deliberately vague — the number means different things depending on what it measures. The six questions:

1. Is 10 TB raw text, raw documents (PDFs, images, etc.), or ingested-and-indexed? The interviewer says 10 TB of raw documents: mostly PDF, some HTML, some Office docs. After text extraction, expect ~15% of that to be extractable text, so ~1.5 TB of actual text — call it ~300 billion characters, or ~75 billion tokens. Chunked at 500 tokens per chunk with 50-token overlap, that’s ~150 million chunks.

2. What are the queries? Semantic natural language, keyword, structured filters, or all three? Natural language with optional filters (date range, document type, author). Assume ~5 QPS average, ~30 QPS peak. Low volume compared to the chatbot — retrieval workloads rarely exceed a few hundred QPS unless they’re consumer-facing.

3. What’s the latency target? End-to-end p95 < 2 seconds for a user-facing RAG query. Retrieval alone should fit in 400 ms p95; reranking in another 200 ms; LLM generation in the remaining ~1.4 s.

4. What’s the freshness requirement? New documents must be searchable within 15 minutes. Deletions (GDPR, takedowns) within 5 minutes. The freshness number rules out most monolithic reindexing strategies.

5. Do we need multi-tenancy? Yes — documents belong to different customers with access-control metadata. A query from customer A must never return documents from customer B. This pushes ACL filtering into every retrieval hop.

6. What’s the quality target? Recall@20 > 90% on a golden eval set. LLM-as-judge faithfulness score > 0.85. Grounding rate (every claim cited) > 95%.

The candidate writes each answer down. The numbers matter for the derivation chain that follows.

117.2 Estimate: from document count to index size to node count

Chain 3 from Chapter 115, walked out loud:

Raw documents: 10 TB.
Extractable text: 1.5 TB (~15% of raw for mixed-format corpora).
Tokens: 1.5 TB / (~20 bytes/token average for mixed PDF/HTML with markup overhead) ≈ ~75 billion tokens. (Assume UTF-8 with mostly Latin-script content; raw-text bytes per token average higher than plain prose due to extraction artifacts.)
Chunks at 500 tokens each: 75B / 500 = 150 million chunks.
Embedding dimension: 1024 (a balanced modern model like BGE-large or nomic-embed-text-v1.5).
Raw vector bytes: 150M × 1024 × 4 bytes (fp32) = ~614 GB raw vectors.
With product quantization (OPQ or IVF-PQ, ~8× compression): ~77 GB compressed index.
With HNSW graph overhead (~1.5× the raw vector size): 614 × 1.5 = ~920 GB for an in-memory HNSW index.

The candidate announces the fork: “I have two choices. Pure HNSW in RAM needs about 1 TB of memory across the cluster; a PQ-compressed index fits in under 100 GB but costs recall. I’d start with HNSW across ~5 nodes at 256 GB RAM each — that’s the ‘fits comfortably’ shape.”

Disk footprint for raw chunks + metadata: ~2 TB on S3 (text, offsets, doc IDs, ACL tags).
BM25 sparse index: OpenSearch/Elasticsearch-class, ~20% of the chunk size, ~300 GB across 3–5 shards.
Embedding compute: 75B tokens × embedding model throughput. A BGE-large on 1× A10 does ~10k tokens/sec, so 75B / 10k ≈ 7.5M seconds = 87 days on one GPU. Parallelize to 30 embedding GPUs → 3 days to rebuild from scratch.
Storage cost: 2 TB × $0.023/GB/month = $47/month on S3, negligible.
Retrieval nodes: 5× vector nodes + 3× BM25 nodes + 2× reranker nodes + 2× router nodes ≈ 12 machines for the retrieval layer.

The candidate pivots: “The expensive part isn’t storage; it’s the embedding GPUs during reindex. I need to decide whether to build the reindex as a background batch job running at ~30 GPU-days per full rebuild, or as a streaming pipeline that embeds on document arrival. Given the 15-minute freshness requirement, it’s the streaming model.”

117.3 High-level architecture — the full RAG pipeline

The diagram:

  Ingestion side:
  
  [ document source (S3 bucket, SFTP drop, API) ]
         |
         v
  [ ingestion queue (Kafka partitioned by tenant) ]
         |
         v
  [ text extractor (Tika / Unstructured / custom) ]
         |
         v
  [ chunker (recursive, 500-token windows, semantic overrides) ]
         |
         v
  [ embedding service (TEI fleet, BGE-large on A10s) ]
         |
         v
  [ upsert pipeline ]
     |                   |
     v                   v
  [ vector store ]   [ BM25 index ]
  (HNSW, sharded     (OpenSearch)
   by tenant hash)
         |                   |
         +--------+----------+
                  |
           [ metadata store (Postgres) ]
           (doc IDs, ACLs, versions)

  Query side:
  
  [ client ]
         |
         v
  [ API gateway + auth ]
         |
         v
  [ query rewriter (HyDE + decomposition, optional) ]
         |
         v
  [ ACL filter + tenant scoping ]
         |
         v
  [ hybrid retrieval ]
    |                         |
    v                         v
  [ dense retrieval ]   [ BM25 retrieval ]
  (HNSW, top-200)       (top-200)
    |                         |
    +------------+------------+
                 |
                 v
  [ fusion (RRF) -> top-100 candidates ]
                 |
                 v
  [ cross-encoder reranker (bge-reranker-large on GPU) ]
                 |
                 v
  [ context packer (top-20, dedup, diversity) ]
                 |
                 v
  [ LLM serving fleet (same as Chapter 116) ]
                 |
                 v
  [ citation extraction + grounding check ]
                 |
                 v
  [ response ]

graph LR
  subgraph Ingest["Ingestion Pipeline"]
    SRC["Document source\nS3 / SFTP / API"] --> KAFKA["Kafka\npartitioned by tenant"]
    KAFKA --> EXT["Text Extractor\nTika / Unstructured"]
    EXT --> CHUNK["Chunker\n500-token windows"]
    CHUNK --> EMB["Embedding Service\nTEI + BGE-large on A10"]
    EMB --> VEC["Vector Store\nHNSW / Qdrant"]
    EMB --> BM25["BM25 Index\nOpenSearch"]
    VEC --> META["Metadata Store\nPostgres: doc_id, ACL"]
    BM25 --> META
  end
  subgraph Query["Query Pipeline"]
    CLIENT["Client"] --> GW2["Gateway + Auth"]
    GW2 --> RW["Query Rewriter\nHyDE (optional)"]
    RW --> ACL["ACL Filter\ntenant scope"]
    ACL --> DENSE["Dense HNSW\ntop-200"]
    ACL --> SPARSE["BM25\ntop-200"]
    DENSE --> RRF["RRF Fusion\ntop-100"]
    SPARSE --> RRF
    RRF --> RANK["Cross-encoder Reranker\nbge-reranker-large · GPU · 80 ms"]
    RANK --> PACK["Context Packer\ntop-20 · dedup · MMR"]
    PACK --> LLM["LLM Fleet\n(Chapter 116)"]
    LLM --> CITE["Citation + Grounding\ncheck"]
  end
  style RANK fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The reranker is where 80% of precision comes from — retrieval narrows 150M chunks to 100, but the reranker separates the relevant 20 from the near-miss 80.

Boxes named with technologies:

Ingestion queue: Kafka with Strimzi operator (Chapter 88), partitioned by tenant to preserve ordering per-tenant.
Text extractor: Apache Tika or Unstructured.io (Chapter 61) — extractors are a messy layer; the candidate explicitly flags “extraction quality is the hidden failure mode of every RAG system.”
Chunker: recursive character-level splitter with semantic overrides (headings, bullet points); 500-token window + 50-token overlap.
Embedding service: TEI (Chapter 49) with BGE-large; alternative is nomic-embed-text or a domain fine-tune.
Vector store: a managed ANN service (Pinecone/Weaviate/Qdrant/Vespa) or self-hosted HNSW on FAISS. The candidate picks Qdrant for the filtering support, but names the tradeoffs.
BM25 index: OpenSearch with per-tenant aliases.
Reranker: bge-reranker-large on a TEI fleet, or Cohere rerank as a managed service.
LLM fleet: same as Chapter 116. The RAG system shares the inference fleet with the chatbot.

The interviewer says “drill into retrieval.”

117.4 Drill 1: the retrieval layer

The job. Given a query and a tenant, return the top-20 most relevant chunks ranked by a relevance score that the reranker produced, within 600 ms p95.

Why HNSW, specifically. HNSW (Chapter 59) is the graph-based ANN index that dominates production because it is fast, handles per-vector filtering reasonably, and degrades gracefully. For 150M vectors at 1024 dim, the in-memory footprint is ~600 GB raw, ~900 GB with graph overhead. This fits comfortably on 4–6 nodes at 256 GB RAM. The alternative is IVF-PQ, which compresses to ~75 GB but loses ~5–10% recall, which is unacceptable given the 90% recall@20 target.

For 10 TB of raw documents, a 10× scale-up would push the index past the RAM-only regime. At that point, DiskANN (Vamana) becomes attractive — it keeps the graph on SSD and still delivers <20 ms p95 latency for billion-scale indexes. The candidate names this as the scale-out plan: “at 100 TB, I’d benchmark DiskANN seriously. At 10 TB, HNSW in RAM is the right call.”

Sharding strategy. Shard by tenant hash. This keeps each tenant’s data co-located on a single shard, which is the simplest tenancy model and avoids the fan-out problem. The downside: a single large tenant can skew shards. Mitigation: split large tenants across multiple shards with a second-level hash, and route queries with explicit fan-out for those tenants.

ACL filtering. Every chunk has metadata tags (tenant ID, access group, document type, date). At query time, filter by tenant_id = X AND access_group IN (user_groups). Qdrant and Weaviate support per-vector filter tags natively; FAISS does not, so FAISS-based setups have to post-filter (retrieve top-K larger, then filter), which is slower and recall-unsafe for selective filters.

The post-filtering recall problem. If you retrieve top-200 and only 5 match the filter, your effective recall is terrible. The rule of thumb: if filter selectivity is <1%, pre-filter (apply the filter before the ANN search). Most vector stores support this via an inverted index on the filter keys. Qdrant does it well; this is why the candidate picked it.

Hybrid search with BM25. Dense retrieval misses keyword matches, especially for rare terms (product codes, error messages, proper nouns). BM25 complements it. Both are queried in parallel, each returns top-200, and the results are fused via RRF (Chapter 60):

rrf_score(doc) = sum over retrievers r: 1 / (k + rank_r(doc))

with k = 60 typically. The fusion happens in the query service, not in the vector store. Fusion typically adds 10% recall over dense alone and 5% over a tuned dense+sparse concatenation.

Reranker placement. The reranker is a cross-encoder (bge-reranker-large, ~300M parameters) that scores query-document pairs directly. It is ~100× slower than embedding-based scoring but much more accurate. The pattern: retrieve top-100 from hybrid search, rerank to top-20, send top-20 as context. Reranking 100 candidates at ~5 ms each (on a GPU) = 500 ms — too slow for the 2-second budget if done sequentially. Instead, batch 100 candidates in one GPU call: ~80 ms on an A10. Acceptable.

An alternative placement: rerank at ingestion time against expected queries. Not feasible here because queries are free-form natural language. Rerank-at-query is the right pattern.

Latency budget for retrieval.

Stage	p50	p95
Query rewrite (optional)	100 ms	200 ms
Embedding the query	5 ms	15 ms
Dense HNSW top-200	15 ms	40 ms
BM25 top-200	10 ms	30 ms
RRF fusion	2 ms	5 ms
Reranker top-100 → top-20	60 ms	120 ms
Context pack + ACL check	5 ms	15 ms
Total retrieval	~200 ms	~425 ms

The reranker takes the single largest slice of the retrieval budget — autoscale the reranker fleet first when p95 latency breaches target.

Plus LLM generation, the full response is ~1.5 s p95 — under the 2-second target.

117.5 Drill 2: the ingestion and reindex pipeline

Ingestion is where most RAG systems rot. The candidate drills it because it’s the differentiator.

Flow. A new document lands in S3 (via SFTP drop, API upload, or periodic crawl). An S3 event triggers a Kafka message to the ingestion topic, partitioned by tenant. Workers consume from the topic in parallel; each worker:

Downloads the document from S3.
Runs extraction (Tika for PDF, BS4 for HTML, Pandoc for Office).
Chunks the extracted text using the recursive chunker with semantic override for headings.
Publishes chunks to an embeddings topic.
A second fleet consumes from embeddings, batches chunks, calls the embedding service, and upserts vectors into the vector store and text into the BM25 index.
A metadata row is written to Postgres with (doc_id, tenant_id, version, ingest_ts, chunk_count, status).

End-to-end latency from document arrival to searchability: ~5–10 minutes for a typical PDF, dominated by extraction and embedding. The 15-minute SLO is met with margin.

Updates and deletes. Documents change. A versioning scheme: each doc has a version number; new versions write new chunks with the new version, and old chunks are marked superseded_at in the metadata store. A background sweeper removes superseded chunks from the vector and BM25 indexes after a grace period. Deletes are harder — they must be immediate for GDPR. The pattern: a delete message in Kafka triggers immediate removal from vector store, BM25 index, and metadata. The sweeper handles tombstones.

Reindex. When the embedding model changes (e.g., upgrade from BGE-large-v1 to v2), the entire corpus must be re-embedded. This is expensive — 30 GPU-days of embedding compute. The strategy is dual-index: stand up the new index in parallel, re-embed the corpus into it, run both indexes in shadow mode, and cut over when the new one meets the quality bar. Cut-over is gradual via traffic split. This avoids a cutover cliff and lets eval catch regressions before production sees them.

Embedding batching. A TEI embedder is most efficient with batch size 64 and sequence length 512. A chunk at 500 tokens matches. The embedding fleet runs at ~10k tokens/sec per A10, ~30k per H100. For 75B tokens at ~30k/sec/GPU, a 30-GPU fleet rebuilds in ~30 hours of wall time.

Cost of the embedding layer. 30 GPUs × $1/hour × 30 hours = $900 per full rebuild. Incremental ingestion is much cheaper — at 100k new documents/day (~5M chunks, ~2.5B tokens), the daily cost is ~$80. Negligible.

117.6 Hybrid search and reranker placement

Some interview-friendly deep points:

Why dense+sparse beats dense alone. Dense embeddings miss rare tokens because they compress them into the vector space; they also miss exact phrase matches because they’re trained to generalize. BM25 preserves both. Empirically (MTEB and BEIR benchmarks), dense+BM25 with RRF consistently adds 5–10% recall over either alone on heterogeneous corpora. The gain is largest on corpora with lots of proper nouns, codes, or technical jargon.

Why RRF specifically. RRF doesn’t require normalizing scores across retrievers. Dense retrievers return cosine scores in [−1, 1], BM25 returns unbounded term-weight sums. Averaging them directly is bad. RRF uses only the rank, which is scale-invariant. k = 60 is a standard choice from the original RRF paper (Cormack et al. 2009).

Why reranker placement matters. The reranker is the highest-signal component in the pipeline. A mediocre retriever + strong reranker beats a strong retriever + no reranker. The classic failure mode: candidates over-invest in retriever quality and skip the reranker entirely. The reranker is where 80% of precision gains come from.

Reranker as bottleneck. At 30 QPS peak and ~80 ms per reranker call, one reranker GPU handles 30 / (1/0.08) = 30 / 12.5 ≈ 2.4 GPUs-worth of load. Two reranker GPUs with margin. If QPS grows 10×, the reranker is the first bottleneck to scale.

HyDE and query rewriting. HyDE (Hypothetical Document Embeddings) generates a hypothetical answer via LLM and embeds that instead of the query. It helps when the query is short and ambiguous. Costs an extra LLM call (~500 ms), so it’s optional and gated by query length. Other query rewriting techniques (multi-query, query decomposition) are similar — useful for specific query classes, not universal.

117.7 Context packing and the LLM call

After the top-20 chunks are selected, they must be packed into the LLM’s context window. The constraints:

Token budget: typical LLMs have 32k–128k context. Assume 16k token context for the RAG call, leaving room for system prompt + query + generation.
Dedup: chunks from the same document may be redundant. Group by doc_id and keep the highest-scoring chunk per document, up to 10 documents.
Diversity: prefer coverage over concentration. MMR (Maximal Marginal Relevance) with λ=0.5 balances relevance and diversity.
Ordering: place the highest-scoring chunks at the start (for relevance) and at the end (for recency bias in LLMs). Middle-of-context information is less likely to be attended to.
Citation metadata: each chunk is wrapped with [doc_id=X, page=Y] so the LLM can cite its source.

The prompt structure:

You are a helpful assistant. Answer the question using only the provided context. 
Cite each claim with its [doc_id, page] tag.

Context:
[doc_id=123, page=4] ...chunk text...
[doc_id=123, page=5] ...chunk text...
[doc_id=456, page=1] ...chunk text...
...

Question: {user_query}

Answer:

Grounding check. After the LLM responds, a post-processor extracts the citation tags and verifies that every factual claim has a citation. Uncited claims are flagged. A second LLM call (a cheap one, GPT-4o-mini class) scores the response for faithfulness against the cited chunks. Low scores trigger a re-try with a different prompt or different chunks.

117.8 Evaluation, freshness, and staleness

Evaluation harness. Three layers (Chapter 64):

Retrieval eval: a golden set of (query, relevant-doc-IDs) pairs. Measured: recall@20, MRR, nDCG@10. Runs nightly and on every index change.
End-to-end eval: golden set of (query, ideal-answer) pairs scored by LLM-as-judge on faithfulness, relevance, and citation accuracy. Runs on every reranker or LLM change.
Online eval: user feedback (thumbs), click-through on citations, and implicit signals (did the user ask a follow-up, suggesting the first answer was incomplete).

The golden set starts at 500 examples and grows to thousands over time. Every production incident adds an example. Every time a new document type comes in, a handful of examples are added.

Freshness. Ingestion lag measured as now() - ingest_ts.p95. Alert if > 20 minutes. The freshness SLI is the fraction of documents searchable within 15 minutes of arrival — SLO 95%.

Staleness. A different problem: a document in the index is still present but has been updated or deleted in the source. The sweeper process checks for stale entries and removes them. Rate: run every 5 minutes.

Reindex schedule. Full reindex is triggered only on embedding model upgrades. Expected once every ~6 months. Incremental is continuous.

117.9 Operations, failure modes, and cost

Cost breakdown per month (rough, 10 TB corpus, 30 QPS peak):

Component	Monthly cost
Vector store (5× 256 GB RAM nodes, compute + RAM)	~$5,000
BM25 / OpenSearch (3 nodes, medium)	~$1,500
Embedding fleet (5× A10, always-on)	~$1,000
Reranker fleet (2× A10, always-on)	~$500
Ingestion workers (K8s autoscaling)	~$1,000
Kafka (3× m5.xlarge brokers)	~$500
S3 storage (2 TB)	~$50
Postgres metadata (db.m5.large)	~$200
LLM serving (shared with chatbot)	—
Total (retrieval layer only)	~$9,750/month

The retrieval layer is cheap compared to LLM serving. Most of the cost in a RAG product is the LLM call, not the retrieval.

Top failure modes:

Extraction quality drop. A new document format (encrypted PDFs, image-heavy docs) is extracted as empty strings. Effect: those documents are “in the index” but unsearchable. Mitigation: track extraction yield per document; alert on <80% extraction rate.
Embedding model drift. A new embedding model version changes the vector space, making old and new vectors non-comparable. Mitigation: version the embedding model in the vector metadata; refuse to mix versions during queries.
Reranker timeout. Under load, the reranker p99 latency blows up and RAG requests time out. Mitigation: aggressive reranker autoscaling (KEDA on queue depth); shed load before the reranker if the queue is too deep.
ACL leak. A bug in the filter layer returns cross-tenant documents. Catastrophic. Mitigation: integration tests that deliberately attempt cross-tenant reads and assert they fail.
Index corruption. HNSW graphs can become inconsistent under heavy concurrent writes. Mitigation: write through a single-writer queue per shard; snapshot to S3 daily for rollback.

Observability: retrieval-specific metrics include retrieval_recall_at_k_online, retrieval_latency_seconds, rerank_batch_size, ingestion_lag_seconds, extraction_yield_ratio, index_write_qps, and cross_tenant_filter_violations_total (should always be 0; alert on any non-zero).

117.10 Tradeoffs volunteered

HNSW vs IVF-PQ: HNSW for recall; IVF-PQ if scale pushes past 100 TB.
Managed vector store vs self-hosted: managed (Qdrant Cloud, Pinecone) for operational simplicity; self-hosted FAISS if tight cost control is needed.
BGE-large vs nomic-embed-text vs domain fine-tune: general model for initial build; domain fine-tune if recall plateau hits below target.
Pre-filter vs post-filter on ACLs: pre-filter mandatory for correctness; post-filter only for high-cardinality filters.
Full reindex vs incremental: incremental for steady state; full reindex only on embedding model upgrade.
Cross-encoder reranker vs ColBERT-style late interaction: cross-encoder for precision; ColBERT if latency budget is tight and recall trades.
Kafka vs SQS vs direct DB-based queue: Kafka for throughput and ordering; SQS is simpler if volume is <10k msg/sec.
Same LLM fleet as chatbot vs dedicated: shared for cost; dedicated if RAG’s long-context patterns starve the chatbot’s low-latency patterns.

117.11 The mental model

Eight points to take into Chapter 118:

“10 TB” is ambiguous. Ask if it’s raw, extracted, or indexed. The factor of 10 between raw bytes and useful tokens changes the sizing completely.
HNSW in RAM for ~1B vectors, DiskANN for larger. The crossover is around 1–2 TB of vector data.
Hybrid search (dense + BM25 with RRF) is the default. It’s non-negotiable on heterogeneous corpora.
The reranker is where 80% of precision comes from. Always include it; never skip it to save latency.
Ingestion is where RAG systems rot. Extraction yield, dedup, versioning, ACL tagging — all of it must be observable.
ACL filtering must be pre-filter, not post-filter. Post-filter destroys recall on selective filters.
Freshness and staleness are different problems. Freshness = ingest lag; staleness = updated-but-not-invalidated entries.
Evaluation is three layers: retrieval, end-to-end, and online. All three are mandatory for a trustworthy RAG system.

Chapter 118 moves to the safety-critical variant: designing a content moderation pipeline, where false positives and false negatives have very different costs.

Read it yourself

Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — the foundational RAG paper.
Cormack, Clarke, and Büttcher, Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods — the RRF paper.
The MTEB benchmark paper and the MTEB leaderboard on Hugging Face — for picking embedding models.
The Pinecone and Qdrant documentation on filtering and hybrid search.
The Ragas paper and GitHub repository for evaluation tooling.
Khattab et al., ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT — for the alternative to cross-encoder reranking.
The DiskANN paper (Subramanya et al.) for billion-scale ANN on SSD.

Practice

Re-estimate the index size for 100 TB of raw documents (10× the problem). At what point does HNSW in RAM become infeasible?
Design the sharding scheme for a tenant-skewed workload: the largest tenant is 30% of the corpus, the smallest is 0.01%.
The freshness SLO is tightened to “searchable within 60 seconds.” What changes in the ingestion pipeline? Which components become bottlenecks?
A new document class arrives that extraction fails on (encrypted PDFs). Design the monitoring and fallback path.
The embedding model changes from BGE-large to a newer version. Walk through the dual-index rollout plan end to end, including eval gates and rollback.
A query returns 0 results due to an over-aggressive filter. Design the debugging loop: logs, traces, eval replay, operator tooling.
Stretch: The interviewer says “now we’re adding image documents — the corpus is 10 TB of mixed text and images.” Extend the design with multimodal embeddings. What changes in retrieval, reranking, and context packing? Which existing components stay, and which are replaced?