RAG evaluation: Ragas, LLM-as-judge, golden sets
"How do you know your RAG system is working? You don't, unless you measure. And measuring is hard"
In Chapter 20 we covered the eval crisis for LLMs in general. RAG systems have all of those problems plus their own: you’re now evaluating two things — the retrieval and the generation — and they interact in subtle ways.
This chapter is about how to evaluate RAG systems honestly. By the end you’ll know:
- The four metrics that matter most for RAG.
- How Ragas and similar tools work.
- Why end-to-end evaluation isn’t enough.
- How to build a golden set for your specific RAG application.
- The trade-offs between automated and human evaluation.
Outline:
- The two-component evaluation problem.
- Retrieval-side metrics.
- Generation-side metrics.
- End-to-end metrics.
- Ragas — the framework.
- LLM-as-judge for RAG.
- Golden sets for RAG.
- The faithfulness problem.
- Production monitoring.
64.1 The two-component evaluation problem
A RAG system has two main components, each with its own quality signal:
The retriever: did it find the right documents? The generator (LLM): did it produce a correct answer given the documents?
A bad retriever + good LLM = wrong answer (the LLM has no information to work with). A good retriever + bad LLM = wrong answer (the LLM has the right context but doesn’t use it). A good retriever + good LLM = right answer.
Evaluating end-to-end (“did the user get the right answer?”) tells you whether the system works overall. But it doesn’t tell you which component is failing when it doesn’t. To debug and improve, you need separate metrics for each component.
This is the two-component evaluation problem. The right approach is to measure both component-level metrics and end-to-end metrics.
64.2 Retrieval-side metrics
The classical IR metrics apply directly:
Precision@K: of the top-K retrieved documents, what fraction are relevant?
Recall@K: of all the relevant documents in the corpus, what fraction are in the top-K?
MRR (Mean Reciprocal Rank): average of 1/rank_of_first_relevant_document across queries.
nDCG@K: normalized Discounted Cumulative Gain — weights relevance by position, so getting the right answer at rank 1 is worth more than at rank 10.
For RAG specifically, the most useful metric is context recall: of the information needed to answer the question, what fraction is in the retrieved context? This is harder to measure than classical recall because “information needed to answer” is subjective.
A practical approach: have a labeled dataset where each query is annotated with the document IDs (or text spans) that contain the answer. Then context recall = (retrieved documents that contain the answer) / (total documents that contain the answer).
For most production RAG, the most important retrieval metric is whether the answer is in the top-K. If the answer isn’t in the retrieved context, the LLM has no chance of producing it correctly. Track this as a binary “answer in context: yes/no” metric.
64.3 Generation-side metrics
Once the retriever has done its job, the generator’s job is to produce a correct answer using the retrieved context. The generation metrics are:
Faithfulness: does the answer stay within what the context says? Does it avoid making things up that aren’t in the context?
Answer relevance: is the answer actually responsive to the question?
Correctness: is the answer factually correct?
These three are different and they all matter. An answer can be:
- Faithful but wrong (the context was wrong, the answer correctly summarizes the wrong context).
- Correct but unfaithful (the LLM ignored the context and used its training data, getting the right answer but not because of RAG).
- Relevant but vague (the answer addresses the question but doesn’t actually answer it).
- Faithful, correct, and relevant (the goal).
For RAG, faithfulness is the most important generation metric because it measures whether RAG is actually working. If the answer is correct but not faithful, you don’t need RAG — the LLM already knew the answer. If the answer is faithful but wrong, the retrieval failed.
Faithfulness can be measured by an LLM-as-judge: give the judge the question, the retrieved context, and the answer, and ask “is every claim in the answer supported by the context?“
64.4 End-to-end metrics
The metric the user actually cares about: did they get a good answer?
For tasks with a known correct answer (factual Q&A), this is straightforward — compare the answer to the ground truth. For open-ended tasks (summarization, advice, explanation), it’s harder; you need either human evaluation or LLM-as-judge.
The standard end-to-end metrics:
Exact match: did the answer exactly match the expected answer? Useful for short factual answers (“What year was Washington born?”). Less useful for long-form answers.
F1 score on tokens: token-level overlap between the predicted answer and the expected answer. Used in older QA benchmarks (SQuAD).
BLEU / ROUGE: surface-form overlap metrics borrowed from machine translation and summarization. Crude but cheap.
LLM-as-judge correctness: have an LLM compare the predicted answer to the expected answer and score it on a 1-5 scale.
Human evaluation: the gold standard. Have humans read the answers and rate them.
For most production RAG, LLM-as-judge correctness on a curated golden set is the practical default. It’s automated, cheap, and correlates reasonably with human judgment.
64.5 Ragas — the framework
Ragas (Retrieval-Augmented Generation Assessment, Exploding Gradients, 2023) is the most popular open-source RAG evaluation framework. It implements a suite of metrics:
- Faithfulness: are claims in the answer supported by the context?
- Answer relevance: how relevant is the answer to the question?
- Context precision: how much of the retrieved context is relevant?
- Context recall: how much of the needed information is in the retrieved context?
- Context utilization: did the LLM actually use the retrieved context?
- Answer correctness: how factually correct is the answer (against a ground truth)?
Each metric is computed using LLM-as-judge — Ragas constructs a specific prompt for each metric and scores the result.
A typical Ragas evaluation:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
dataset = {
"question": [...],
"answer": [...], # the model's answer
"contexts": [...], # the retrieved contexts
"ground_truth": [...], # the expected answer
}
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(results)
The output is a score per metric per example, plus aggregate scores.
Ragas is the de facto standard for RAG evaluation in 2024-25. It’s not perfect (LLM-as-judge has its own biases) but it’s useful as a starting point.
Other frameworks: TruLens, DeepEval, promptfoo, LangSmith eval — they all implement similar metrics with slightly different prompts and aggregation. The differences matter less than the fact that you’re measuring something.
64.6 LLM-as-judge for RAG
The LLM-as-judge pattern (Chapter 20) applied to RAG:
For each (question, context, answer, ground_truth) tuple, prompt an LLM to score the answer:
Prompt:
You are evaluating a RAG system. Given the following question, retrieved
context, and generated answer, rate the answer on:
1. Faithfulness (1-5): are all claims supported by the context?
2. Relevance (1-5): does the answer address the question?
3. Correctness (1-5): is the answer factually correct?
Question: {question}
Context: {context}
Answer: {answer}
Ground truth: {ground_truth}
Provide scores in JSON format with brief explanations.
The judge LLM (typically a strong one like GPT-4 or Claude) reads the inputs and produces structured scores.
Strengths of LLM-as-judge for RAG:
- Automated, scales to thousands of examples.
- Captures nuance that exact-match metrics miss.
- Multi-dimensional: can score faithfulness, relevance, correctness separately.
- Cheap compared to human evaluation.
Weaknesses (the same as Chapter 20):
- Position bias if comparing two answers.
- Length bias preferring longer answers.
- Self-preference when judge and answer come from the same model family.
- Inconsistency across runs (the judge’s scores have variance).
Mitigations:
- Use multiple judges of different families and average.
- Run each judgment twice with answer order swapped, average.
- Calibrate against human eval periodically to verify the scores correlate.
For most production RAG, LLM-as-judge with one strong judge model + periodic human spot checks is the practical approach. Don’t trust any single judge call; aggregate over many examples.
64.7 Golden sets for RAG
The most actionable evaluation tool: a golden set of representative test cases.
For RAG specifically, the golden set is a curated collection of (question, expected_answer) pairs (and optionally, expected_relevant_documents). The questions cover the diversity of real user queries; the expected answers are vetted as correct.
Building a RAG golden set:
- Sample real user queries from production logs (anonymized).
- Curate 50-200 representative ones covering different topics, difficulty levels, and intents.
- For each query, write the ideal answer based on the actual content of your corpus.
- For each query, label the documents that contain the answer (this gives you the retrieval ground truth).
- Run your RAG system on the golden set and compute metrics.
The golden set is your ground truth. You compare candidate RAG configurations against it. When you change the embedder, the chunking, the reranker, or the LLM, run the eval on the golden set and see if the metrics improve.
The golden set should evolve over time:
- Add new examples as your understanding of the use case grows.
- Remove or update outdated examples as the corpus changes.
- Add adversarial examples that probe for known failure modes.
The golden set is the single most important eval artifact for production RAG. Build one early, maintain it carefully, and trust it more than any benchmark.
64.8 The faithfulness problem
The deepest challenge in RAG evaluation: how do you tell whether the LLM is using the context or not?
A model might answer correctly because:
- It used the context. RAG is working.
- It already knew the answer from training. RAG is unnecessary.
- It guessed correctly. RAG is broken but lucky.
Without isolation, you can’t tell the difference. The way to test:
Test 1: Ablation. Run the same questions with and without retrieved context. If the answers are the same, the retrieved context didn’t matter (case 2 or 3).
Test 2: Adversarial context. Inject misleading context and see if the model follows it. If the model produces the truth despite misleading context, it’s relying on training data (not the context).
Test 3: Out-of-distribution questions. Ask questions whose answers are NOT in the training data but ARE in the corpus. If the model answers correctly, it must have used the context.
For most RAG applications, the OOD test is the most useful. Pick questions where the answer is definitely in your corpus but not in the LLM’s training data (e.g., questions about your company’s internal policies, recent events post-training-cutoff, or specific document content). Test the model on these to verify RAG is actually contributing.
If the OOD test fails (the model answers incorrectly), RAG isn’t working. If it passes, RAG is contributing.
This is a critical test that most teams skip. The result is they think RAG is working when it isn’t — they’re really just relying on the LLM’s training data to answer questions, and the retrieval is window dressing.
64.9 Production monitoring
Beyond pre-deployment evaluation, production RAG needs continuous monitoring. The metrics to track:
Retrieval metrics (per-request):
- Number of candidates retrieved.
- Average and max retrieval latency.
- Distribution of retrieval scores (are they well-distributed or all clustered?).
- Cache hit rate (for prefix caching, embedding caching).
Generation metrics (per-request):
- TTFT and TPOT (Chapter 31).
- Output length distribution.
- Refusal rate (“I don’t know” responses).
Quality metrics (sampled):
- Periodically run LLM-as-judge on a sample of production responses.
- Track faithfulness, relevance, correctness scores over time.
- Alert on regressions.
User feedback metrics:
- Thumbs-up / thumbs-down ratings.
- Click-through rate on cited sources.
- Conversation length (longer might mean dissatisfaction).
These give you a continuous picture of RAG quality. Combined with the golden set evaluation (run before each deployment), you have a reasonable monitoring story.
The hardest part is the quality metrics — they’re noisy and hard to interpret. The pragmatic approach is to track them as trend lines and alert on large changes, not on absolute thresholds.
64.10 The mental model
Eight points to take into Chapter 65:
- RAG has two components to evaluate: retrieval and generation. Measure each separately plus end-to-end.
- The most important retrieval metric: is the answer in the top-K candidates?
- The most important generation metric: faithfulness — does the answer stay within the context?
- Ragas is the standard open-source RAG evaluation framework.
- LLM-as-judge is the practical default for automated evaluation. Has biases; use multiple judges.
- Golden sets are the most actionable evaluation tool. Build one early.
- The faithfulness problem: test that the LLM is actually using the context, not just relying on training data.
- Production monitoring continuous quality tracking + alerting on regressions.
In Chapter 65 we close out Part IV by putting all the RAG pieces together end-to-end.
Read it yourself
- The Ragas documentation and GitHub repository.
- Es et al., Ragas: Automated Evaluation of Retrieval Augmented Generation (2023).
- The LangSmith eval documentation.
- The TruLens documentation.
- Evaluating LLMs is a Minefield (any of the recent blog posts on the topic).
Practice
- Build a tiny golden set of 10 questions for a RAG system you’ve built (or imagined). Include questions and expected answers.
- Why is faithfulness more important than correctness for evaluating RAG? Argue.
- Run Ragas on a small RAG output dataset and interpret the scores.
- Design an OOD test for a RAG system you’re building. What questions would you use?
- Why does LLM-as-judge have position bias? How do you mitigate it?
- For a production RAG system, design a monitoring dashboard. What metrics would you display?
- Stretch: Build a RAG evaluation pipeline that runs the golden set on every commit. Set up alerting on significant metric regressions.