Chapter 65: Designing a RAG system end to end

This is the closing chapter of Part IV — Information Retrieval & RAG. We’ve covered every component of a RAG system in Chapters 55-62: lexical retrieval, dense retrieval, vector indices, hybrid search, chunking, reranking, query rewriting, and evaluation. This chapter puts them all together into a coherent end-to-end design and a step-by-step process for building production RAG.

By the end you’ll have a complete reference architecture and the operational playbook for building RAG systems that actually work.

Outline:

The reference architecture.
The end-to-end pipeline.
Design decisions in order.
Operational considerations.
Common failure modes.
The build sequence.
Scaling considerations.
The Part IV capstone.

65.1 The reference architecture

A complete production RAG system has the following components:

[User query]
     |
     v
+--------------------+
|  Conversational    |  ←  conversation history
|  query rewriting   |     (LLM call to standalone-ify)
+--------------------+
     |
     v
+--------------------+
|  Query expansion   |  ←  optional: HyDE, multi-query
|  / HyDE            |     (LLM call to expand)
+--------------------+
     |
     v
+--------------------+
|  Hybrid retrieval  |
|  ┌──────┬───────┐  |
|  │ BM25 │ Dense │  |  ←  TEI for embeddings
|  └──────┴───────┘  |     Elasticsearch for BM25
|        |           |     Qdrant for vectors
|     [RRF fusion]   |
+--------------------+
     |
     | top 100 candidates
     v
+--------------------+
|  Cross-encoder     |
|  reranker          |  ←  TEI with bge-reranker-large
+--------------------+
     |
     | top 10 reranked
     v
+--------------------+
|  Context assembly  |
|  - format          |
|  - dedupe          |
|  - cite sources    |
+--------------------+
     |
     v
+--------------------+
|  LLM generation    |  ←  vLLM
|  with context      |
+--------------------+
     |
     v
[Response with citations]


[Indexing pipeline (offline)]
+--------------------+
| Documents          |
| (PDFs, HTML, etc.) |
+--------------------+
     |
     v
+--------------------+
| Parse / extract    |  ←  Unstructured, pdfplumber
| text               |
+--------------------+
     |
     v
+--------------------+
| Chunk              |  ←  recursive char splitter
| (with overlap)     |     parent-document setup
+--------------------+
     |
     v
+--------------------+
| Embed              |  ←  TEI with bge-large
+--------------------+
     |
     +--> [Vector DB (Qdrant)]
     |
     +--> [BM25 index (Elasticsearch)]
     |
     +--> [Document store (S3 / Postgres)]

Two pipelines: the online query pipeline (top, runs per request) and the offline indexing pipeline (bottom, runs when documents change).

This is the reference. Production systems vary in details, but the core pieces are always present.

The reference architecture separates offline indexing (builds two indexes from raw documents) from online querying (rewrite → parallel retrieve → RRF → rerank → generate) — each stage has its own latency budget.

65.2 The end-to-end pipeline

Walking through a query:

Step 1: Query reception. The user’s query arrives at the API gateway with conversation history (if multi-turn).

Step 2: Conversational rewriting. A small LLM rewrites the query to be standalone, incorporating context from the conversation.

Step 3: Optional HyDE / expansion. If the system uses HyDE, the LLM generates a hypothetical answer to use as the retrieval anchor.

Step 4: Parallel hybrid retrieval.

BM25 retrieves the top 100 candidates from the lexical index.
Dense retrieval embeds the (rewritten / hypothetical) query and retrieves the top 100 from the vector index.

Step 5: RRF fusion. The two lists are merged via Reciprocal Rank Fusion into a single ranked list of ~150-200 unique candidates.

Step 6: Reranking. The cross-encoder reranks the top 100 candidates from the fused list, producing a final ranking.

Step 7: Context assembly. The top 5-10 reranked candidates are formatted into the prompt context, with citation markers and any necessary deduplication.

Step 8: LLM generation. vLLM generates the answer, conditioned on the system prompt + context + user query.

Step 9: Streaming and citation rendering. The response streams to the user, with citations linked back to the source documents.

The total latency budget:

Conversational rewriting: ~50 ms
HyDE: ~100 ms (optional)
Retrieval: ~20 ms
Reranking: ~150 ms
LLM TTFT: ~300 ms
LLM decode: seconds

Total TTFT: ~600-800 ms. Total e2e: depends on response length, typically 5-30 seconds.

65.3 Design decisions in order

When building a RAG system from scratch, make these decisions in this order:

(1) What’s the corpus?

Document types (PDFs, HTML, code, structured data).
Total size (number of documents, total tokens).
Update frequency (static or dynamic?).
Languages (English-only or multilingual?).

This determines the parsing pipeline, the chunking strategy, the embedder choice, and the index size.

(2) What questions will users ask?

Length distribution (short keywords or long natural language?).
Type (factual, analytical, multi-hop, conversational?).
Domain (general or specialized?).
Volume (low or high QPS?).

This determines the query rewriting strategy, the retrieval depth, and the LLM choice.

(3) What’s the latency budget?

TTFT requirement.
Total response time tolerance.
Throughput requirement.

This determines whether you can afford reranking, query rewriting, multiple retrieval calls, etc.

(4) What’s the quality requirement?

High accuracy (legal, medical) or best-effort?
Faithfulness critical or just helpfulness?

This determines the LLM choice, the context size, the safety layer.

(5) Pick the components.

Based on the above, pick:

Parsing: Unstructured, pdfplumber, custom parsers.
Chunking: recursive character splitter (default), or parent-document for long contexts, or function-level for code.
Embedder: BGE-large-en (English) or BGE-M3 (multilingual) — modern defaults.
Vector DB: Qdrant (default), pgvector if you’re already on Postgres.
BM25 index: Elasticsearch or OpenSearch.
Reranker: bge-reranker-large.
LLM: depends on the budget. Llama 3 70B is a strong default for self-hosted; GPT-4o or Claude for API.
Query rewriting: conversational rewriting always; HyDE optional.

(6) Build the indexing pipeline first.

Get documents in.
Parse and chunk.
Embed and store.
Verify the index has the right number of documents and the embeddings look right.

(7) Build the query pipeline.

Retrieval first.
Then reranking.
Then LLM generation.
Then query rewriting.

Test each stage before adding the next.

(8) Build the eval pipeline.

Curate a golden set of 50-200 questions with expected answers.
Run the eval; iterate on the configuration until quality is acceptable.

(9) Deploy with monitoring.

Production observability (latency, throughput, error rate).
Periodic eval on the golden set.
User feedback collection.

This is the ordered playbook. Most teams skip steps (especially eval) and regret it.

65.4 Operational considerations

The points that matter once you’re in production:

(1) Index versioning. When you change the embedder or chunker, you need to re-index. The new index lives alongside the old one until you cut over. Don’t try to update in place.

(2) Document update flow. When a source document changes (added, modified, deleted), the index needs to reflect it. Decide on a frequency (real-time, hourly, daily) and build the pipeline.

(3) Cold start for new users. A new user’s first query might not benefit from caching. Plan for the latency variance.

(4) Tenant isolation. If multiple tenants share the system, their corpora need to be separated. Use metadata filtering or per-tenant indices.

(5) Citation accuracy. When the LLM produces an answer, the citations should map back to actual retrieved documents. Track this; it’s a common source of bugs.

(6) Monitoring. Track retrieval metrics, generation metrics, end-to-end latency, and user feedback separately. Each tells you about a different failure mode.

(7) A/B testing. When you want to try a new configuration (different embedder, different reranker, different LLM), run it as a canary against a fraction of traffic and compare metrics.

(8) Incident response. When the LLM produces a bad answer, can you debug? Save the (query, retrieved context, answer) tuples to a logging system so you can reproduce and analyze later.

These operational concerns aren’t glamorous, but they’re what separates a working RAG system from a demo.

65.5 Common failure modes

A non-exhaustive list of ways RAG goes wrong:

(1) Bad parsing. PDF extraction is hard. Tables, footnotes, multi-column layouts often get mangled. The garbage-in-garbage-out problem.

(2) Wrong chunk size. Chunks too small lose context; chunks too big are unfocused. Tune.

(3) Embedder mismatch. The embedder isn’t trained on your domain. Quality is poor.

(4) Missing hybrid retrieval. Pure dense (or pure BM25) misses obvious matches.

(5) No reranker. The bi-encoder’s ranking has a noisy top-K, and the LLM gets the wrong context.

(6) The model isn’t using the context. The LLM is producing answers from training data, not from the retrieved documents. The OOD test from Chapter 64 catches this.

(7) Outdated index. The corpus changed, but the index is stale. The LLM produces answers based on old information.

(8) Tenant leakage. A query from tenant A retrieves documents from tenant B. Critical security bug.

(9) Citation hallucination. The LLM cites a document that doesn’t actually contain the claim. Faithfulness failure.

(10) Context window overflow. The retrieved context is too large; the LLM truncates it silently.

(11) Conversational rewriting failure. A multi-turn query fails because the rewriting didn’t capture the right context.

(12) Latency creep. Each stage adds 100-200 ms; the total TTFT becomes unacceptable.

These are all real failure modes I’ve seen in production. Each has a fix. The key is to monitor for them and to test for them in eval.

65.6 The build sequence

A practical build sequence for a new RAG system:

Week 1: Get something working.

Pick a small corpus.
Use LangChain or LlamaIndex with default settings.
Recursive chunker, BGE-large-en, FAISS, no reranker, GPT-3.5 or similar.
Hardcoded system prompt.
Just get end-to-end working with a few queries.

graph LR
  W1["Week 1\nBaseline\n(working e2e)"] --> W2["Week 2\nBuild eval\n(golden set)"]
  W2 --> W3["Week 3\nImprove retrieval\n(hybrid + rerank)"]
  W3 --> W4["Week 4\nImprove generation\n(stronger LLM, HyDE)"]
  W4 --> W5["Week 5\nOperationalize\n(monitoring, deploy)"]
  W5 --> W6["Week 6+\nIterate\n(watch prod)"]
  style W2 fill:var(--fig-accent-soft),stroke:var(--fig-accent)

Build eval before optimizing — the golden set in Week 2 is what tells you whether Weeks 3 and 4 actually improved anything.

Week 2: Build evaluation.

Curate a golden set of 30-50 questions with expected answers.
Run the system; measure quality.
Don’t try to improve yet; just get a baseline.

Week 3: Improve retrieval.

Add hybrid retrieval (BM25 + dense).
Add reranking (bge-reranker-large).
Tune chunking (recursive with overlap, maybe parent-document).
Re-run eval; verify improvement.

Week 4: Improve generation.

Switch to a stronger LLM if needed.
Tune the system prompt.
Add conversational rewriting.
Add HyDE if quality is still suffering.
Re-run eval.

Week 5: Operationalize.

Deploy to a production environment.
Add monitoring (Prometheus, Grafana).
Add alerting on quality regressions.
Run a canary against real traffic.

Week 6+: Iterate.

Watch production metrics.
Add new failure modes to the golden set as you discover them.
Periodically re-tune the configuration.

This sequence prioritizes getting something working before optimizing. The biggest mistake is to spend weeks on the perfect chunking strategy before validating that the rest of the pipeline works.

65.7 Scaling considerations

As you scale, the considerations shift:

Small scale (under 100k documents, low QPS):

Single-node setup is fine.
Use Qdrant or LanceDB locally.
One-shot indexing.

Medium scale (100k-10M documents, modest QPS):

Distributed Qdrant or managed (Pinecone) for the vector DB.
Elasticsearch for BM25.
TEI on a few GPUs for embedding and reranking.
vLLM for generation, with continuous batching.
Caching for common queries.

Large scale (10M+ documents, high QPS):

Sharded vector DB.
Sharded BM25 index.
Multi-replica TEI fleet.
Multi-replica vLLM fleet with autoscaling.
Cross-replica KV cache sharing (Chapter 53).
Comprehensive monitoring and alerting.

Frontier scale (100M+ documents, high QPS):

Custom vector index, possibly disk-based with NVMe.
Distributed retrieval across many shards.
Aggressive caching at every layer.
Dedicated ops team.

For most teams, medium scale is where you live. The architecture in §65.1 is designed for it.

65.8 The Part IV capstone

This closes Part IV — Information Retrieval & RAG. You now have the full picture:

Chapter 57: BM25 and lexical retrieval as the foundation.
Chapter 58: dense retrieval and embedding model training.
Chapter 59: vector index internals (HNSW, IVF, FAISS).
Chapter 60: hybrid retrieval and RRF fusion.
Chapter 61: chunking strategies and the parent-document pattern.
Chapter 62: cross-encoder reranking.
Chapter 63: query rewriting (HyDE, conversational, decomposition).
Chapter 64: RAG evaluation and golden sets.
Chapter 65 (this chapter): putting it all together end-to-end.

You should be able to:

Design a RAG system for any reasonable use case.
Pick components based on the workload.
Build an evaluation pipeline that catches real failures.
Diagnose RAG quality issues from the metrics.
Scale a RAG system to production traffic.

In Part V we move from retrieval to agents — the systems that go beyond single-query retrieval to multi-step reasoning, tool use, and decision-making.

65.9 The mental model

Eight points to take into Part V:

A RAG system is two pipelines: indexing (offline) and query (online).
The reference architecture: parse → chunk → embed → index, then retrieve → fuse → rerank → generate.
Build incrementally, test each stage before adding the next.
Build evaluation early. Without eval, you can’t tell if you’re improving.
Hybrid retrieval + reranking is the modern default for the retrieval side.
Conversational rewriting + HyDE are the most useful query-side techniques.
Operational discipline matters more than any single component choice.
Most RAG quality is in the chunking and retrieval, not the LLM.

In Part V — Agents, Tool Use, and Workflow Orchestration — we shift from retrieval to action.

Read it yourself

LangChain and LlamaIndex documentation — both have end-to-end RAG tutorials.
The LangSmith documentation on RAG evaluation.
The Pinecone learning resources on RAG architecture.
Building LLM Applications for Production posts by Chip Huyen.
The Anthropic / OpenAI cookbook examples for RAG.

Practice

Sketch the architecture for a RAG system serving a customer support knowledge base. Identify each component and your choice for it.
Build a minimal RAG system in 100 lines of Python: parse a few PDFs, chunk, embed (with sentence-transformers), retrieve (FAISS), generate (with the OpenAI API).
Curate a 20-question golden set for the system you built. Run it and report results.
Identify three failure modes in your RAG system. How would you fix each?
Design the indexing pipeline for a RAG system over 1M Wikipedia articles. What’s the chunking? What’s the embedder?
Why is “build something working before optimizing” the right approach? Argue.
Stretch: Build a complete production-grade RAG system with all the components in §65.1. Run a real evaluation and iterate until you reach a quality threshold.