Part III · Inference Internals & Production Serving
Chapter 42 ~17 min read

Test-time compute and reasoning models: o1, R1, MCTS-decoding

"You can buy quality with more inference compute, just like you can buy quality with more training compute. The latter has been clear since 2020. The former is the story of 2024-25"

The dominant story in LLMs from 2020-2023 was scale. Bigger models with more training data give better quality. The Chinchilla scaling laws (Chapter 11) formalized this.

The dominant story of 2024-25 has been test-time compute. Instead of (or in addition to) making the model bigger, you let it think longer at inference time. Generate multiple candidate answers, sample many reasoning chains, search through possibilities, select the best. The result is models that are dramatically better at reasoning tasks at the cost of much higher inference compute per query.

OpenAI’s o1 (released September 2024) was the first widely-known model in this category. DeepSeek-R1 (January 2025) was the first open-source replication. The technique has spread quickly and is reshaping how the field thinks about model quality.

This chapter is about the architecture and inference picture of reasoning models. By the end you will understand:

  • The basic idea: trade inference compute for quality.
  • How chain-of-thought and self-consistency relate to reasoning models.
  • The o1 paradigm and the R1 replication.
  • Why reasoning models are dramatically more expensive to serve.
  • What this means for the inference stack.

Outline:

  1. The test-time compute idea.
  2. Chain-of-thought as a precursor.
  3. Self-consistency and best-of-N sampling.
  4. Tree search and MCTS-decoding.
  5. The o1 paradigm.
  6. DeepSeek-R1 and the open replication.
  7. Inference cost: why reasoning models are 10-100× more expensive.
  8. Serving implications.
  9. The state of reasoning in 2025.

42.1 The test-time compute idea

The Chinchilla scaling law says: better model quality requires more training compute, with parameters and tokens both scaling. Doubling training compute gives a predictable improvement in benchmark scores.

The new observation: you can also buy quality with more inference compute. For a fixed model, sampling more responses, doing more steps of reasoning, or searching through more possibilities at inference time consistently improves accuracy on reasoning tasks. The improvement isn’t free — it costs compute proportional to how much extra you spend — but it’s real and exploitable.

The two regimes:

Train-time compute scaling. Bigger models, more data, longer training runs. Cost is paid once during training. Inference is comparatively cheap. This was the dominant approach 2020-2023.

Test-time compute scaling. Same model, more compute per query. Cost is paid per request. Inference is much more expensive but the model itself is the same size. This is the new approach starting in 2024.

Crucially, the two are complementary. A bigger model with more test-time compute is better than either alone. The frontier is in combining them.

Test-time compute is most effective for reasoning tasks — math, code, multi-step logic — where the model can verify its own outputs and improve. It’s less effective for tasks where there’s no clear way to “think harder” (open-ended writing, simple factual questions).

42.2 Chain-of-thought as a precursor

The first hint of test-time compute scaling was chain-of-thought (CoT) prompting (Wei et al., 2022). The observation: if you prompt an LLM to “think step by step” before giving its final answer, the answer is more often correct on reasoning tasks.

The mechanism: by writing out intermediate reasoning, the model effectively spends more inference compute on the problem. Each intermediate step is a token that the model has to predict, and predicting it costs compute. The model uses that compute to reason about the problem.

CoT is a prompting technique, not an architecture. It works on any LLM that can produce coherent text. The improvements on reasoning benchmarks are large — sometimes 10-30 percentage points on math problems.

The downside: CoT only works if the model already knows how to reason. If the underlying model is weak at math, prompting it to “think step by step” doesn’t help much. The model has to be capable of producing useful reasoning steps in the first place.

CoT was the first sign that test-time compute matters. The reasoning models of 2024-25 take this much further.

42.3 Self-consistency and best-of-N

The next step beyond CoT: sample multiple reasoning chains and pick the best one.

Self-consistency (Wang et al., 2022): for a math problem, generate K different reasoning chains with sampling (non-zero temperature), extract the final answer from each, and vote — pick the answer that appears most often. Empirically, this improves accuracy because correct answers are more likely to be reached via multiple distinct reasoning paths.

Best-of-N sampling: similar idea but with a different aggregation. Instead of voting, use a verifier (another model, or a checker like a Python interpreter) to score each candidate and pick the best.

Both techniques exploit the fact that an LLM can sometimes produce a correct answer and sometimes not. Sampling many attempts and picking the best is much more reliable than picking the first attempt. The cost is K× the inference compute.

Best-of-N sampling: K independent chains generated from the same prompt; a verifier picks the best final answer. Prompt (same) chain 1 ans: 42 chain 2 ans: 42 chain K ans: 37 (wrong) …K chains total… Verifier (vote or score) Best answer 42 (majority)
Best-of-N samples K independent chains; the verifier picks the answer with the most votes (self-consistency) or the highest score — correct answers are more likely to appear multiple times than wrong ones.

For math benchmarks, self-consistency with K=64 samples can boost accuracy by 5-15 points compared to single-sample CoT. The cost is 64× the inference compute per query, which is significant.

These techniques are still pure prompting / sampling — no model changes. But they pushed the boundary of how much compute you could throw at a query and get returns.

42.4 Tree search and MCTS-decoding

The next conceptual step: search, not just sample. Instead of independently sampling K reasoning chains, explore a tree of partial reasoning chains, expanding the most promising ones, pruning the bad ones.

The idea is borrowed from game-playing AI (AlphaGo, AlphaZero) where Monte Carlo Tree Search (MCTS) is the dominant technique. For language reasoning, you replace “game state” with “partial reasoning chain” and “move” with “next reasoning step.”

The setup:

  1. Start with the prompt as the root of the tree.
  2. Expand: pick a leaf node and generate K candidate next reasoning steps.
  3. Evaluate: score each candidate (with a learned value function or a verifier).
  4. Select: pick the best candidate based on the score.
  5. Repeat until you reach a final answer.

The key technical piece is the value function that scores partial reasoning chains. You can train a small neural network to predict “is this partial chain on track to be correct?” — this is exactly what AlphaGo does for board positions. The value function is trained on labeled examples of correct/incorrect reasoning chains.

MCTS decoding tree: nodes are partial reasoning chains, edges are next-step candidates, value scores guide which branches to expand. Prompt root node step A v=0.82 ✓ step B v=0.55 step C v=0.12 ✗ step A1 v=0.91 ✓ step A2 v=0.44 final answer ✓ Pruned branches save compute vs. naive best-of-N. Value function v ∈ [0,1] guides which nodes to expand next.
MCTS-decoding concentrates compute on high-value branches (dark border), pruning low-value paths early — reaching the correct answer with far fewer total tokens than exhaustive best-of-N sampling.

MCTS-decoding can use much less total compute than naive best-of-N because it focuses compute on the promising branches. But it’s more complex to implement and requires the value function.

Several research papers in 2023-24 explored MCTS-decoding for reasoning. The o1 paradigm builds on these ideas (though OpenAI hasn’t published the exact details).

42.5 The o1 paradigm

OpenAI’s o1 (September 2024) is a different kind of reasoning model. Some details from OpenAI’s blog and papers:

  • The model is trained with reinforcement learning to do long chains of thought before answering.
  • The reasoning process involves many internal “thinking” steps that aren’t shown to the user.
  • The total tokens generated per query can be 10-100× more than a normal LLM, because of the long reasoning chains.
  • The accuracy on math, code, and reasoning benchmarks is dramatically higher than non-reasoning models.

The key shift: o1’s reasoning is part of the model, not an external scaffold. The model has been trained (with RL on reasoning tasks) to generate long internal chains of thought as part of its generation. You can’t get o1’s behavior by prompting a normal LLM with “think harder” — the model itself is different.

Specifically, the training process for o1-style models (as best as the public understands it):

  1. Start with a strong base model.
  2. Generate many reasoning chains for math/code problems with the base model.
  3. Use a process reward model that scores not just the final answer but the intermediate steps.
  4. Train the model with RL (PPO or similar) to produce reasoning chains that score well according to the reward model.
  5. The result is a model that has learned to reason at length on hard problems.

The training is harder than standard RLHF (Chapter 17) because the reward signal is more sparse and the rollouts are longer. OpenAI hasn’t published the exact recipe, but the research community has reverse-engineered the rough shape.

The user-facing experience: o1 takes longer to respond (because of the internal reasoning) but produces dramatically better answers on hard problems. On AIME 2024 (a math competition), o1 scores 83% — compared to ~13% for GPT-4o. A 6× improvement in accuracy from test-time compute alone.

42.6 DeepSeek-R1 and the open replication

In January 2025, DeepSeek released DeepSeek-R1, the first open-source model that replicates the o1 paradigm. The technical report described the training recipe in detail (more openly than OpenAI did for o1):

  • Start with DeepSeek-V3 (the 671B MoE base, Chapter 34).
  • Apply GRPO (Group Relative Policy Optimization), a variant of PPO designed for RL on reasoning tasks.
  • Train on math and code problems with rule-based rewards (the answer is checkable: math has a known answer, code can be tested).
  • The model learns to produce long reasoning chains as part of its responses.

The result: R1 matches or exceeds o1 on many reasoning benchmarks, and the weights are openly released. This was a watershed moment for the field — the first public, reproducible recipe for reasoning models at frontier scale.

DeepSeek also released R1-Distill variants — smaller models distilled from R1’s reasoning chains. These show that the reasoning capability can be transferred from a large model to a smaller one via supervised distillation, dramatically reducing the compute cost of serving reasoning models.

The R1 release accelerated the field. Many open-source reasoning model variants appeared in early 2025 (QwQ from Qwen, several community fine-tunes). The recipe is public and being iterated rapidly.

42.7 Inference cost — why reasoning models are 10-100× more expensive

The headline operational fact about reasoning models: they generate enormous numbers of tokens per response.

A normal LLM responds to “what is 2+2?” with “4” — one token. A reasoning model responds with a multi-paragraph reasoning chain followed by the answer:

Let me think about this. The question asks what 2+2 equals.

I need to add the numbers 2 and 2. Adding two of something to two of the same thing
gives me four of that thing.

Therefore, 2+2 = 4.

The answer is 4.

That’s ~60 tokens for a question that should take 1. For harder problems, the reasoning chain can be thousands of tokens.

The empirical numbers from o1 and R1:

  • Average response length on a hard math problem: 5,000-30,000 tokens of reasoning.
  • Compared to a normal LLM: ~500 tokens.
  • The cost ratio: 10-60× more decode tokens per query.
Token output comparison: normal LLM emits ~500 tokens per query, reasoning model emits 5000-30000 tokens. 0 5k 10k 15k 20k tokens Normal LLM ~500 tok Reasoning model ~15,000 tok average (hard math) Per-token cost is identical; per-query cost is 10-60× higher.
Reasoning models are not more expensive per token — they are expensive because they emit 10–60× more tokens per query; the entire cost difference comes from the longer decode phase.

Decode tokens are exactly the dominant cost in serving (Chapter 30). A reasoning model that emits 10× the tokens costs 10× more to serve per query. Per-token costs are the same; per-query costs are 10-100× higher.

This is the new cost regime. Reasoning models are not “more expensive per token” but “more tokens per query.” The total cost shift is dramatic.

For OpenAI, o1 is priced at roughly 30-60× the per-query cost of GPT-4o for the equivalent capability. This reflects both the compute cost (more tokens) and the value (dramatically better quality).

42.8 Serving implications

Reasoning models change every layer of the serving stack:

(1) Decode dominates. The prefill is unchanged, but the decode is 10-100× longer. The serving stack is even more decode-bound than for normal LLMs.

(2) KV cache pressure scales. Each in-flight request now has 10-30k tokens of cache instead of 1-2k. The KV cache memory pressure is severe.

(3) Latency expectations change. Users expect a long pause before the answer. Streaming the reasoning chain (or hiding it) is part of the UX. Time-to-first-token matters less; total response time matters more.

(4) Continuous batching is even more important. With long decode tails, continuous batching is essential to keep the GPU busy.

(5) Quality vs cost trade-off shifts. Smaller reasoning models (R1-Distill 7B, 14B, 32B) can replicate the reasoning behavior at much lower cost. The distillation approach is critical for production deployment.

(6) Generation length caps need rethinking. A standard generation cap of 2000 tokens cuts off reasoning models mid-thought. Reasoning models need higher limits (16k+ tokens).

(7) Speculative decoding is more valuable. Long generations benefit from speculative decoding’s per-token speedup (Chapter 27). The benefits compound.

(8) Cost modeling is different. Per-million-tokens cost is the same; per-query cost is 10-100×. Customers paying for reasoning models need to understand this — their costs will be much higher than for normal LLMs.

The serving stacks (vLLM, SGLang) have added features to support reasoning models: longer generation lengths, optimized handling of long decode tails, streaming for the reasoning portion vs the answer portion, etc.

42.9 The state of reasoning models in 2025

Where the field is in late 2025:

  • OpenAI o1, o1-mini, o3 (in preview): the original line. Best quality on hard reasoning, very expensive.
  • DeepSeek-R1 and variants: open-source frontier. Comparable to o1 on many benchmarks.
  • R1-Distill (Llama 8B, Qwen 7B, etc.): distilled smaller models. Surprisingly capable, much cheaper to serve.
  • QwQ from Alibaba: another open reasoning model in the R1 lineage.
  • Claude with extended thinking (Anthropic): reasoning capability added via alignment training.
  • Gemini 2 with thinking (Google): similar.
  • Various community fine-tunes of Llama and Qwen using R1-style training.

The pattern: reasoning is now a feature category, not a single model. Most frontier model providers have a reasoning variant. The trade-off — much higher cost per query for much better quality on hard tasks — is being explicitly priced and offered.

The next directions:

  • Smaller reasoning models that don’t require frontier-scale compute. R1-Distill is the start.
  • Reasoning for non-math tasks: applying the technique to coding, agents, search, planning.
  • Better verifiers to provide reward signal for tasks where the answer isn’t checkable by a rule.
  • Combined train-time and test-time scaling: bigger models that also do test-time compute.

The field is moving fast. Expect reasoning models to become as standard as instruction-tuned models over the next 1-2 years. Production deployments will need to handle them.

42.10 The mental model

Eight points to take into Chapter 43:

  1. Test-time compute is a new scaling axis. Quality improves with more inference compute per query, not just with more training compute.
  2. Chain-of-thought was the precursor — prompting a model to think step by step.
  3. Self-consistency and best-of-N sample many candidates and pick the best. K× compute, often substantial improvement.
  4. MCTS-decoding searches through reasoning trees. More efficient than sampling but more complex.
  5. The o1 paradigm trains the model itself (with RL) to produce long internal reasoning chains.
  6. DeepSeek-R1 is the first open replication of o1. Recipe is public.
  7. Reasoning models cost 10-100× more per query because they emit 10-100× more tokens. Per-token cost is unchanged.
  8. Serving stacks need to handle long decode tails, larger KV caches, and higher generation length limits.

In Chapter 43 we close out Stage 3 (research frontier) with structured generation — the technique that makes models reliably produce JSON, regex-matching output, and other constrained formats.


Read it yourself

  • Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2022).
  • Wang et al., Self-Consistency Improves Chain of Thought Reasoning in Language Models (2022).
  • The OpenAI o1 system card and blog post (September 2024).
  • The DeepSeek-R1 technical report (January 2025). The most detailed public account of how to train a reasoning model.
  • Snell et al., Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters (2024). The empirical scaling-law paper for test-time compute.
  • The R1-Distill model cards on HuggingFace.

Practice

  1. Compute the per-query cost difference between a normal LLM (500 token average response) and a reasoning model (15000 token average response) at $0.50/M tokens.
  2. Why does test-time compute help reasoning tasks more than open-ended writing tasks? Argue at the level of “what does the extra compute buy.”
  3. Why does self-consistency with K=64 samples improve math accuracy by 10+ points over single-sample CoT? What’s the failure mode of single-sample?
  4. Read the DeepSeek-R1 paper. Identify the difference between R1-Zero (pure RL) and R1 (RL + SFT). Why both?
  5. How does R1-Distill capture R1’s reasoning capability in a smaller model? Trace the distillation pipeline.
  6. What changes for the serving stack when most requests are reasoning queries? List five operational impacts.
  7. Stretch: Run R1-Distill 7B and a standard Llama 7B on the same set of math problems. Compare accuracy, response length, and total compute time.