Chapter 20: Evaluation: the hardest unsolved problem in ML

This is the closing chapter of Part II, and it’s the most important one to internalize for production work. Every other chapter in this part — pretraining, fine-tuning, RLHF, distillation — depends on evaluation to know whether the change was an improvement. If your eval is broken, every decision downstream is broken. And evaluations of LLMs are, frankly, broken in interesting ways that the field has not figured out how to fix.

This chapter is about how to evaluate LLMs honestly. By the end you will know:

The four families of LLM evaluation and when to use each.
The contamination problem and why benchmark scores are upper bounds.
The LLM-as-judge pattern, including its biases.
How to build a golden set that actually catches regressions in your use case.
Why arena-style human comparison is the most contamination-resistant signal we have.
How to talk about evaluation in interviews without sounding naive.

Outline:

The fundamental hardness of LLM evaluation.
Loss vs accuracy vs quality.
Standard benchmarks: MMLU, GPQA, HumanEval, BBH, MT-Bench, Arena-Hard, IFEval.
lm-eval-harness and the framework problem.
Benchmark contamination — what it is, how widespread, how to detect.
LLM-as-judge — the dominant pattern and its biases.
Vibe checks and golden sets — the unscalable but essential piece.
The arena leaderboard — the contamination-resistant gold standard.
Building an internal eval suite.
The eval crisis and where the field is heading.

20.1 The fundamental hardness

Why is LLM evaluation hard? Because the things you actually want to measure — “helpfulness,” “honesty,” “harmlessness,” “good writing,” “correct reasoning” — are subjective, multi-dimensional, and context-dependent. There is no ground truth for “this is a better explanation of photosynthesis than that one.” There are partial proxies (length, citation accuracy, technical depth), but none of them capture the whole.

This is a sharp departure from traditional ML evaluation. For image classification, you compute accuracy on a held-out test set with known labels. For machine translation, you compute BLEU against reference translations. For speech recognition, you compute word error rate. These are imperfect but bounded — you can argue about whether BLEU correlates with quality, but you can compute it deterministically.

LLM evaluation has none of this nice structure. The “correct answer” is “any of many plausible answers, with the better ones distinguished by qualities that are hard to operationalize.” So the field has converged on a few imperfect approximations that, taken together, give you a fuzzy picture of where a model stands.

The four families:

Standard benchmarks with closed-form answers (multiple choice, code execution, math). Easy to score, easy to game, contaminated.
LLM-as-judge benchmarks (MT-Bench, Arena-Hard). Score open-ended responses with a bigger model. Cheap, biased.
Human pairwise comparison (LMSYS Arena, internal red teams). Most reliable, slowest and most expensive.
Internal task-specific evals (golden sets, regression suites). Most actionable for production work, narrowest in scope.

A serious eval strategy uses all four. A bad eval strategy uses one and trusts it.

No single eval method sits in the upper-right corner — a production eval strategy must combine all four, using benchmarks for quick regressions and the arena for ground truth.

20.2 Loss vs accuracy vs quality

Three different things you might measure during training, and what each one actually tells you:

Training loss / validation loss. The cross-entropy of the model’s predictions against the training data. Decreasing loss means the model is getting better at the training distribution. It does not mean the model is getting better at the task you care about. Validation loss is a useful sanity check that something is improving, but you cannot ship “lower validation loss” as a quality signal.

Per-task accuracy on a benchmark. The fraction of benchmark questions the model gets right. Better than loss because it measures task performance directly, but inherits all the problems of the benchmark (contamination, narrow coverage, etc.).

End-to-end quality on your actual use case. The thing you actually care about. Much harder to measure but the only signal that matters for production. Always the “ground truth” in any disagreement between metrics.

The senior engineering move is to always check all three when comparing models and to weight them roughly in order: end-to-end quality > task accuracy > validation loss. If validation loss says model A is better but task accuracy says model B is better, you ship model B. If task accuracy disagrees with end-to-end quality, you investigate why and trust the end-to-end measurement.

20.3 Standard benchmarks

The benchmark zoo, in roughly the order you’ll encounter them:

MMLU — Massive Multitask Language Understanding (Hendrycks et al., 2021)

The benchmark you see on every model card. ~16k multiple-choice questions across 57 academic subjects: math, history, computer science, medicine, law, philosophy, etc. Scored in zero-shot or 5-shot mode. The headline number for general knowledge.

Strengths: broad coverage, easy to score, the standard.

Weaknesses: massively contaminated. Most modern models have seen the MMLU questions during training (it’s been on the open web since 2021). Recent strong scores are inflated. The 5-shot variant is more contamination-prone than the 0-shot.

A score of “85 on MMLU” used to mean “frontier-class.” Now it means “this model is in the same ballpark as anything modern.” MMLU is past its useful life as a primary benchmark.

MMLU-Pro (TIGER-Lab, 2024)

A harder variant of MMLU with more answer choices (10 instead of 4), more reasoning-heavy questions, and an attempt to filter out the contaminated easy questions. Currently more useful than original MMLU.

GPQA — Google-Proof Question Answering (Rein et al., 2023)

Graduate-level science questions specifically designed to be too hard to look up on Google (and by extension, hard to memorize from web training). The questions were written by domain experts and verified by other domain experts. ~448 questions in physics, chemistry, biology.

GPQA is one of the more contamination-resistant benchmarks because the questions are deliberately obscure. Scores in late 2025 are around 60% for the strongest models — compared to 35% for random and ~75% for human PhD students. Strong signal for reasoning ability.

HumanEval — Code completion (Chen et al., 2021)

164 hand-written programming problems where the model has to produce a function body given a docstring. Scored by running unit tests against the generated code. This is one of the cleanest evals in the field because the verification is automated and objective: either the code passes the tests or it doesn’t.

Frontier models score 80–95% on HumanEval. The benchmark is mostly saturated; harder variants exist (HumanEval+, MBPP, LiveCodeBench).

MBPP — Mostly Basic Python Problems (Austin et al., 2021)

Like HumanEval but with ~1000 simpler problems. Often used together with HumanEval as a code-generation benchmark pair.

BBH — BIG-Bench Hard (Suzgun et al., 2022)

A subset of the BIG-Bench benchmark consisting of 23 tasks where the models of 2022 performed worst. Includes logical reasoning, multi-step reasoning, and various challenge tasks. Useful for measuring the “reasoning” axis specifically.

MT-Bench (LMSYS, 2023)

An open-ended chat benchmark: 80 multi-turn questions covering writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. Scored by GPT-4 as a judge (LLM-as-judge — more on this in §20.6). Each response gets a score from 1–10.

MT-Bench was the first widely-adopted LLM-as-judge benchmark. It correlates reasonably with arena rankings and is much cheaper than human eval. Mostly saturated for the strongest models — top scores are around 8.5/10, and the differences between top models are within the noise.

Arena-Hard (LMSYS, 2024)

A successor to MT-Bench using harder prompts (sourced from the LMSYS arena’s actual user queries) and an LLM-as-judge with stronger anti-bias techniques. Currently one of the better LLM-as-judge benchmarks.

IFEval — Instruction Following Eval (Zhou et al., 2023)

Tests whether the model follows specific instructions about format, length, language, etc. Each prompt has verifiable instructions (“respond in exactly 3 paragraphs,” “include the word ‘banana’ three times,” “respond in JSON”). The scoring is automated. Useful for measuring instruction-following discipline specifically, which is hard to test with knowledge benchmarks.

Less-common but important

MATH (Hendrycks et al., 2021): 12.5k math competition problems. Hard.
GSM8K (Cobbe et al., 2021): 8.5k grade-school math word problems. Mostly saturated.
HellaSwag, ARC, Winogrande: classical NLU benchmarks, mostly saturated.
TruthfulQA: tests whether the model produces accurate or hallucinated answers.
RULER (Hsieh et al., 2024): long-context retrieval and reasoning. Hard.

Each benchmark measures one slice of capability. The mistake is to look at any single number and call a model “better.” The right approach is to look at a basket of benchmarks covering different capabilities, plus arena scores, plus your own internal evals.

20.4 `lm-eval-harness` and the framework problem

Running benchmarks isn’t just “load the data and call the model.” Each benchmark has specific scoring conventions, prompt formats, few-shot example selection, and answer extraction logic. Two implementations of “the MMLU benchmark” can produce different scores for the same model because of trivial differences in formatting.

lm-evaluation-harness (EleutherAI) is the most widely-used standardized framework. It implements ~200 benchmarks with consistent prompt formatting and scoring, so you can compare results across models without worrying about implementation drift. Most published benchmark scores in 2023–2024 come from lm-eval-harness.

The framework problem: even small differences in prompt formatting can move scores by several percentage points. The famous example is the choice of how to compare answer choices for multiple-choice questions:

Per-token comparison. For each answer choice, compute the model’s log-probability of the choice’s tokens. Pick the choice with highest log-prob. This is what lm-eval-harness does for most multiple-choice tasks.
Generation comparison. Have the model generate a free-form answer, then parse out which choice it picked. More natural but slower and noisier.

The two methods can produce different rankings. A model that scores 70% under per-token comparison might score 65% under generation comparison. If you compare two models using different methods, you’re not comparing the same thing.

The discipline: when comparing model scores, always check that they were measured the same way. Always be skeptical of “the new model scores X” claims that aren’t accompanied by the exact eval framework, version, and configuration.

20.5 Benchmark contamination

The biggest problem with all of the standard benchmarks: the model has probably seen the answers. The mechanism:

The benchmark is published on the open web (an arXiv paper, a GitHub repo, a HuggingFace dataset).
Common Crawl scrapes it.
The next round of pretraining includes Common Crawl.
The model sees the questions and answers during pretraining, memorizes them, and “scores high” by recall rather than capability.

This is benchmark contamination, and it’s pervasive. Studies in 2023–2024 found that essentially every popular benchmark is contaminated to some degree in essentially every modern open LLM. The contamination is sometimes deliberate (lab adds the benchmark to the training mix to inflate scores) but more often accidental (the benchmark is just on the web, and the lab’s pretraining filter doesn’t catch it).

Detecting contamination:

Memorization tests. Show the model the first half of a benchmark question and ask it to complete it. If it can recite the question verbatim, it’s seen it. This is a strong signal.
Variant tests. Rephrase the benchmark questions or shuffle the answer choices. If the model’s score drops significantly under the variant, it was relying on memorization.
Held-out comparison. Compare the model’s score on the public benchmark to its score on a held-out variant of the same task that wasn’t in any training data. The gap measures contamination.

Recent benchmarks try to be contamination-resistant by various means:

GPQA has questions deliberately too obscure to be on the web.
LiveCodeBench uses competitive programming problems published after the model’s training cutoff.
MMLU-Pro rewrote the easier MMLU questions to defeat memorization.
Arena-Hard uses real user queries from the LMSYS arena, so the questions don’t exist anywhere until they’re served.

For new benchmarks to avoid contamination, they have to be hidden (which limits reproducibility) or continuously refreshed (which requires a release pipeline). Both approaches are now standard for the more rigorous benchmarks.

20.6 LLM-as-judge

For open-ended generation tasks (chat, writing, reasoning), there’s no automated way to score the output. The dominant compromise: use a stronger LLM as the judge. The pattern:

prompt = """
You are evaluating two responses to a user question.

User question: {question}

Response A: {response_a}
Response B: {response_b}

Which response is better? Answer with 'A', 'B', or 'tie'.
Then explain your reasoning briefly.
"""

verdict = judge_model.generate(prompt)

You then aggregate verdicts over many questions to get a win rate, an Elo score, or some other ranking. Frameworks like MT-Bench, Arena-Hard, AlpacaEval, and Chatbot Arena Auto all use variants of this pattern.

LLM-as-judge is the dominant pattern for evaluating chat models. It’s cheap (a few cents per comparison), scales to thousands of comparisons, and correlates reasonably with human judgment. But it has well-documented biases:

LLM-as-judge is cheap and scalable, but position bias, length bias, and self-preference require running every comparison bidirectionally and averaging across judge families before trusting the verdict.

Position bias. The judge tends to prefer the first response. Mitigated by running the comparison twice with the order swapped and averaging.
Length bias. The judge prefers longer responses. The verbosity bias from RLHF interacts badly with this — RLHF’d models produce longer outputs, which judges prefer, which inflates RLHF scores.
Self-preference. When the judge model is from the same family as one of the responses, it tends to prefer that response. (GPT-4 judges prefer GPT-4 outputs; Claude judges prefer Claude outputs.) Mitigated by using a different judge family or by averaging multiple judges.
Style preference. Judges have stylistic preferences (markdown formatting, structured headings, polite tone). Models that match the judge’s style score higher even if they’re not actually better.
Confidence bias. Judges prefer confident-sounding responses over hedged ones, even when hedging is more accurate.

These biases mean LLM-as-judge benchmarks should not be trusted as the only signal. They’re useful as a cheap directional indicator, but they need to be balanced against human eval and against task-specific evals.

The best mitigations:

Use multiple judges of different families. Average GPT-4, Claude, and Llama 3 judges to wash out individual biases.
Use position randomization. Always run each comparison both ways and average.
Use length-controlled scoring. Some frameworks (LC-Win-Rate) explicitly control for length to remove the verbosity bias.
Sanity-check against human eval. Periodically run human evaluation on the same prompts to verify the judge’s rankings hold up.

LLM-as-judge is here to stay. The honest version is “use it carefully, and don’t believe a single number from it without checking the biases.”

20.7 Vibe checks and golden sets

The unscalable but essential piece: manual evaluation. A “vibe check” is when you run a handful of representative prompts through your candidate model and read the outputs yourself. It’s the most expensive form of evaluation per data point, and it’s the only one that catches subtle failures the automated benchmarks miss.

A golden set is the formalized version: a curated list of ~50–500 prompts that represent the things you actually care about, with reference outputs that you trust. You run candidate models on the golden set and either (a) read the outputs manually and rate them, (b) compare them to the reference outputs with an automated metric, or (c) use LLM-as-judge to compare them.

Golden sets are the single most useful eval for production work. They are:

Tailored to your use case. Generic benchmarks don’t measure what you actually need. Your golden set does.
Contamination-resistant. Your golden set is private to your team, so it can’t be in any model’s training data.
Actionable. When you find a regression on the golden set, you know exactly which kinds of inputs broke and you can investigate.
Stable. A well-curated golden set changes slowly over time, giving you a reliable signal across model iterations.

Building one is straightforward but takes effort:

Pick 50–500 prompts that cover the things you care about (the breadth depends on how broad your application is).
For each prompt, write or curate a “good” reference answer.
Run your current production model on the golden set. Inspect the outputs. Adjust the prompts or the references where they don’t match what you want.
When evaluating a new candidate model, run it on the same golden set and compare to the production model’s outputs side-by-side.

A team that does this well catches 90% of production-relevant regressions before deployment. A team that doesn’t relies on broad benchmarks and gets blindsided.

The discipline: the golden set is owned by the team, lives in version control, and is updated when your understanding of the use case evolves. It’s not a one-time artifact — it’s a living document of “what good looks like for our model in our use case.”

20.8 The arena leaderboard

The single most contamination-resistant signal we have for chat-model quality is LMSYS Arena (now called LM Arena). The setup:

Users go to the arena site and type a prompt.
The site sends the prompt to two anonymous models (chosen at random from a pool of dozens of frontier and open models).
The user reads both responses and votes for the better one (or “tie” or “both bad”).
Results are aggregated into Elo ratings via the Bradley-Terry model.

The leaderboard publishes the Elo ratings publicly, updated continuously. It has hundreds of thousands of votes per month and dozens of models in the pool.

Why it’s the gold standard:

The prompts are real. They come from actual users with actual needs, not from a benchmark designer’s imagination.
The prompts change continuously. New prompts are added daily, so models can’t memorize them.
The judging is human. No LLM-as-judge biases.
The model identities are blind. Voters don’t know which model is which, so brand bias is eliminated.
It scales. Hundreds of thousands of votes per month gives statistically meaningful rankings.

The arena is the most-trusted signal in the LLM evaluation field. Frontier labs treat their arena ranking as a primary metric.

The criticisms:

The user population is biased. Arena users are mostly developers and ML researchers, not the general population. They have specific preferences (coding tasks, technical questions) that don’t represent every use case.
The prompts are short. Most arena prompts are a single turn, often just a few hundred tokens. Long-context behavior isn’t tested.
Verbosity bias. Even human voters prefer longer responses when both are correct. The “length-controlled” arena variant tries to fix this.
Style bias. Voters have stylistic preferences (markdown, structure) that drive results.

These are real issues, but they’re orders of magnitude smaller than the problems with benchmark contamination. Arena scores are the best public signal for chat-model quality available right now.

20.9 Building an internal eval suite

The recommended setup for serious production work, in priority order:

A golden set of 50–500 task-specific prompts with reference outputs and explicit pass/fail criteria. The most actionable signal you have.

The regression suite (golden set + benchmarks) runs on every candidate; only models that pass advance to the expensive human-eval tier — keeping the cost-per-candidate low while maintaining the high-trust signal for production decisions.

2. **A regression suite** that runs after every fine-tune, comparing key metrics (golden set scores, a few standard benchmarks, latency, format compliance) against the previous best. 3. **A few standard benchmarks** that you trust: GPQA for reasoning, HumanEval for code, IFEval for instruction following, MMLU-Pro for general knowledge. Don't trust raw MMLU. 4. **An arena-style internal A/B** if your scale supports it: deploy two models to a fraction of traffic and let real users vote. This is the closest thing to "ground truth" you can get. 5. **LLM-as-judge** on a held-out prompt set as a cheap regression check. With multiple judges and position swapping. 6. **Periodic human eval** by domain experts on a subset of the golden set. Once a month, have someone qualified read 50 outputs and rate them. Catches drift that the automated metrics miss. 7. **Production monitoring.** Track output quality signals in production: response length, refusal rate, retries, user thumbs-down. These are noisy but real.

The production eval pipeline runs (1)–(3) on every model candidate, (4)–(5) on every shipped model, and (6)–(7) continuously. The cost of running this is significant — at scale, the eval cluster can be a meaningful fraction of the training cluster — but the cost of skipping it is vastly higher (shipping a bad model with no warning).

20.10 The eval crisis

The deeper problem: none of the metrics in this chapter are great. They’re useful, but they’re a basket of imperfect proxies for “is this model good.” There is no clean answer to “how do I know my new model is better.” The field is in an active eval crisis, and the signals you can publish to your boss about “we improved by X percent” are all measuring something other than what you mean.

The signs of the crisis:

Benchmark saturation. Top frontier models score above 90% on most established benchmarks, leaving no room to differentiate.
Contamination. Existing benchmarks are silently broken by training-data inclusion.
Goodhart’s law. Once a metric is targeted for optimization, it stops being a good measure. Every benchmark eventually becomes one of these.
The “vibes are better” problem. Users sometimes prefer model A over model B for reasons that no metric captures. The arena catches some of this; nothing catches all of it.

The trends to watch:

Continuously refreshed benchmarks. Benchmarks where the questions change weekly to defeat contamination. LiveCodeBench is the leading example.
Capability-specific benchmarks. Instead of one number, a profile across many capability dimensions (reasoning, code, math, multilingual, long context, instruction following).
Process supervision. Evaluate not just the final answer but the reasoning chain that produced it. More signal, harder to score.
Verifier-based eval. For tasks with formal verifiers (code execution, math proofs), the verifier is the eval and it’s contamination-immune. Reasoning models lean on this heavily.
User-preference-as-eval. Continuous A/B testing in production with user votes as the ground truth. Hard to set up, gold standard once you have it.

The honest answer to “how do I evaluate LLMs” in 2025 is: a portfolio of imperfect signals, weighted toward task-specific golden sets and arena scores, with constant skepticism about benchmark numbers. There is no single number you should trust. The senior move is to know which signals to trust under which conditions.

20.11 Part II capstone

This is the last chapter of Part II. You now have the full picture of how models are produced:

Chapter 11: pretraining at scale — data, compute, Chinchilla, the cost of frontier training.
Chapter 12: distributed training — DDP, ZeRO, FSDP, TP, PP, SP, 3D parallelism.
Chapter 13: mixed precision — fp16, bf16, fp8, loss scaling, master weights.
Chapter 14: tokenizer training — BPE algorithm, multilingual sampling, the permanence of the tokenizer.
Chapter 15: fine-tuning — LoRA, QLoRA, the PEFT family, multi-tenant adapter serving.
Chapter 16: SFT — turning a base into an assistant, loss masking, chat templates, catastrophic forgetting.
Chapter 17: alignment — RLHF, DPO, KTO, CAI, the failure modes.
Chapter 18: compression — distillation, dark knowledge, pruning, the lottery ticket.
Chapter 19: synthetic data — the data cliff, Magpie, rejection sampling, model collapse.
Chapter 20 (this chapter): evaluation — the eval crisis, benchmarks, LLM-as-judge, golden sets, the arena.

Together, Parts I and II are the foundation. You should be able to:

Defend Chinchilla scaling and explain why modern practice is “over-Chinchilla.”
Estimate distributed training memory and pick a parallelism strategy for any model size.
Diagnose mixed-precision bugs.
Pick a fine-tuning method for a budget and explain the LoRA serving pattern.
Design an SFT and DPO pipeline.
Recognize the synthetic-data shift and its risks.
Design an internal eval suite that catches production regressions.

In Part III we shift from training to inference: how the models you now know how to build are actually served at scale, starting with the operation that matters most — the KV cache and PagedAttention.

20.12 The mental model

Eight points to take into Part III:

There is no single LLM evaluation metric. Use a portfolio.
Loss < accuracy < end-to-end quality in priority. End-to-end always wins.
Standard benchmarks (MMLU, HumanEval, etc.) are contaminated. Treat their scores as upper bounds, not true measurements.
lm-eval-harness is the standard framework. Even so, prompt formatting can swing scores by several points.
LLM-as-judge has known biases (position, length, self-preference, style). Use multiple judges and position-randomized comparison.
Golden sets are the most actionable signal for production work. Build one.
The LMSYS Arena is the contamination-resistant gold standard for chat model quality.
The eval crisis is real. Be skeptical of every benchmark number, including your own.

In Chapter 21 we open Part III with the prefill/decode asymmetry, which we previewed in Chapter 8 and which is the spine of the entire inference internals stack.

Read it yourself

Hendrycks et al., Measuring Massive Multitask Language Understanding (2021) — the MMLU paper. Read for what it claims and what we know now about its limits.
Chen et al., Evaluating Large Language Models Trained on Code (HumanEval, 2021).
Rein et al., GPQA: A Graduate-Level Google-Proof Q&A Benchmark (2023).
Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (2023). The MT-Bench paper. Also describes the early arena.
Chiang et al., Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (2024). The full arena writeup.
Sainz et al., NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for Each Benchmark (2023). The contamination critique.
The lm-evaluation-harness GitHub README and documentation.
The LMSYS Arena leaderboard at lmarena.ai.

Practice

Pick a model from the LMSYS Arena leaderboard and check its scores on MMLU, GPQA, HumanEval, and Arena. Compare. Do they tell a consistent story?
Run lm-evaluation-harness on a small open model with both 0-shot and 5-shot MMLU. Compare the scores. Why are they different?
Build a golden set of 20 prompts for a use case you care about. Run two open models on it and compare the outputs side-by-side. Which do you prefer? Can you articulate why?
Why is LLM-as-judge biased toward longer responses? Construct a thought experiment and verify on a tiny example.
The LMSYS Arena uses Elo ratings. Why Elo and not raw win rate? What does Elo capture that win rate doesn’t?
Read the GPQA paper. The questions were verified by domain experts; explain why this matters for contamination resistance, and what the limit of that resistance is.
Stretch: Implement an LLM-as-judge evaluation pipeline that compares two models on a prompt set, using GPT-4 (or another strong model) as the judge, with position randomization and length-controlled scoring. Run it on two open models and report results.

Concept check