Chapter 11: Pretraining at scale: data, compute, curriculum

This is the first chapter of Part II, and it’s the one most people skim. They shouldn’t. Even if you will never pretrain a frontier model — and you probably won’t — you should understand what pretraining is, what it costs, and why every later choice in the pipeline (fine-tuning, alignment, evaluation, serving) is downstream of it. Inference choices are downstream of training choices. Production behavior is downstream of training data. Cost is downstream of FLOPs. You cannot reason about LLMs in production without understanding what produced them.

This chapter is a tour of pretraining from the outside. By the end you will be able to:

Explain the Chinchilla scaling law and predict the data-vs-parameter tradeoff for a target compute budget.
Estimate the cost of training a frontier-class model in dollars and weeks.
List the data sources of a modern open LLM and predict their relative weights.
Explain what data deduplication and quality filtering actually do, and why they’re 80% of pretraining quality.
Defend the claim “data is the moat” in an interview.

Outline:

The headline numbers — what frontier pretraining looks like in late 2025.
The cost equation: parameters × tokens × FLOPs.
Chinchilla scaling laws — the optimal training-compute frontier.
Data sources: web, books, code, math, papers.
Deduplication — the most important data step nobody talks about.
Quality filtering — perplexity filters, classifier filters, the “did this come from a textbook” signal.
Tokenization for pretraining — how it ties into Chapter 5.
The training loop: epochs, checkpoints, restarts.
Curriculum learning: does it work?
Monitoring a pretraining run.
The cost: dollars, GPUs, weeks, people.
Why most companies don’t pretrain (and why a few should).

11.1 The headline numbers

To set the scale, here are the rough public numbers for some frontier-class LLMs as of late 2025. Take these as orders of magnitude, not exact figures:

Model	Parameters	Training tokens	Compute (FLOPs)	Estimated cost
GPT-3 (2020)	175B	300B	~3 × 10²³	~$5M
Chinchilla (2022)	70B	1.4T	~6 × 10²³	~$10M
Llama 2 (2023)	70B	2.0T	~10²⁴	~$20M
Llama 3 (2024)	70B	15T	~6 × 10²⁴	~$50M
Llama 3.1 (2024)	405B	15T	~4 × 10²⁵	~$200M+
DeepSeek-V3 (2024)	671B (MoE, 37B active)	14.8T	~3 × 10²⁴	~$6M (publicly claimed)
GPT-4 (rumored)	~1.8T (rumored MoE)	?	~10²⁵	~$100M+

Two patterns to notice. First, training data has grown faster than parameter count. The 70B parameter point of 2022 was trained on 1.4T tokens; by 2024, the same parameter point was trained on 15T tokens — a 10× increase. This is the Chinchilla scaling story (§11.3) playing out in production: it turns out you should spend much more compute on data than the GPT-3 generation did.

Training token counts have grown ~10× faster than parameter counts since GPT-3 — the Chinchilla insight playing out in production, driven by the discovery that inference cost dominates lifetime model cost.

Second, costs have grown but not as fast as the numbers above suggest, because GPU-hours per FLOP have fallen dramatically. An H100 in 2024 delivers roughly 5× the practical FLOPs/sec of an A100 in 2020. The DeepSeek-V3 number — $6M for a 671B MoE — is the most striking recent data point: a frontier-class model trained for one-tenth the cost of Llama 3.1 405B, by being clever about MoE, mixed-precision training (fp8 native), and pipeline scheduling.

The takeaway: pretraining is expensive but not getting linearly more expensive. The improvements come from better algorithms, better hardware utilization, and better data curation — not just from throwing more GPUs at the problem.

11.2 The cost equation

The single most important formula in pretraining cost estimation:

Training compute (FLOPs) ≈ 6 × parameters × tokens

This is the Chinchilla equation for the FLOPs cost of one full pass over the training data with a dense transformer. The factor of 6 comes from breaking down each training step:

The forward pass costs ≈ 2 × parameters × tokens FLOPs. (Each parameter participates in roughly two FLOPs per token: a multiply and an add.)
The backward pass costs ≈ 4 × parameters × tokens FLOPs. (Backward is roughly 2× the forward, because for each weight you compute both the gradient w.r.t. the weight and the gradient w.r.t. the input.)
Total: 6 × parameters × tokens.

For a 70B model trained on 15T tokens:

6 × 70 × 10⁹ × 15 × 10¹² ≈ 6.3 × 10²⁴ FLOPs

Now divide by your hardware throughput. An H100 with reasonable model FLOP utilization (MFU) of ~40% delivers:

989 TFLOP/s (peak bf16) × 0.4 ≈ 400 TFLOP/s sustained

So the H100-hours required:

6.3 × 10²⁴ / (400 × 10¹²) / 3600 ≈ 4.4 × 10⁶ H100-hours

That’s 4.4 million H100-hours. On a 1024-H100 cluster, that’s roughly 4300 hours of wall-clock, or about 180 days. On a 16,000-H100 cluster (which Meta has) it’s about 11 days.

At an effective cost of $2/H100-hour (the cloud rate is higher; the amortized cost at hyperscaler scale is lower), 4.4M H100-hours is ~$9M of pure compute, before the engineering team, the data preparation pipeline, the eval cluster, the failed runs, and the iteration cost.

Memorize the 6PD formula. It is the most-asked back-of-envelope question in ML systems interviews.

11.3 Chinchilla scaling laws

Before 2022, the consensus was “bigger model = better.” GPT-3 was 175B parameters, and it was assumed that the next generation would be even bigger. Then DeepMind’s Chinchilla paper (Hoffmann et al., 2022) measured what actually happens when you fix a compute budget and ask: what’s the optimal split between model size and training data?

Their answer, after running hundreds of experiments at different scales: for a given compute budget, the optimal model size and the optimal training-token count grow at roughly equal rates — both scale as compute^0.5. In other words, doubling your compute should mean doubling both the model size and the data size, not quadrupling the model and keeping the data the same.

GPT-3 sat far below the Chinchilla frontier (too many parameters, too few tokens). Modern practice deliberately trains past the frontier because a smaller, overtrained model is cheaper to serve at scale.

Concretely, the Chinchilla rule of thumb is:

optimal_tokens ≈ 20 × parameters

So a 70B model “wants” about 1.4T tokens of training data to reach its compute-optimal point. A 7B model wants about 140B tokens. A 405B model wants about 8.1T tokens.

This was an enormous result. It implied that GPT-3 (175B parameters, 300B tokens) was massively undertrained — at GPT-3’s compute budget, you should have trained a 70B model on 1.4T tokens instead. Chinchilla itself was that 70B-on-1.4T model, and it outperformed GPT-3 across most benchmarks at one-third the parameter count.

The Chinchilla finding rewrote the playbook. Llama 2 (70B, 2T tokens) was already over-Chinchilla. Llama 3 (70B, 15T tokens) is 10× over Chinchilla — it’s trained on more than ten times the data the scaling law predicts is “optimal” for its size. Why?

Because the Chinchilla optimum is the optimum for training compute alone. It assumes training is the only thing you care about. But once a model is trained, it gets served — and inference compute is much, much larger than training compute over the model’s lifetime. A frontier model that gets queried billions of times at inference will spend orders of magnitude more compute on inference than it did on training. Under this lens, you should accept a higher training cost in exchange for a smaller, faster model that’s cheaper at inference.

So the modern playbook is: train smaller models on more data than Chinchilla suggests, because the inference cost dominates the lifetime cost. Llama 3’s 15T-token training of a 70B model is the canonical example.

This is one of the cleanest “the math told us X, then production told us Y” stories in ML. Understand it. It’s a common interview question.

11.4 Data sources

What does 15T tokens look like? Approximately:

All of Common Crawl (the open web crawl, ~250B web pages): a few trillion tokens of “raw” text after extraction.
All of Wikipedia in every language: ~30B tokens.
All of GitHub’s public code: ~1T tokens.
All of arXiv: ~50B tokens.
A few large book corpora (Books3, the Pile’s books subset, library scraps): ~50B tokens.
Various scientific paper repositories (PubMed, etc.): ~100B tokens.
Refined / filtered subsets of all of the above (next section): much smaller.

The mix matters. Llama 3 was reportedly ~95% English text-heavy data, ~5% code (which is now considered low; code helps with reasoning even on non-code tasks), with various filtering and reweighting. Qwen 2.5 had a much more multilingual mix and significantly more code. DeepSeek-V3 went very heavy on math and code.

The frontier labs have access to data sources that are not in the public set: licensed book corpora, private web archives, transcribed YouTube audio, scraped Twitter/X and Reddit (despite the cease-and-desist letters), proprietary chat logs from their own products. These are part of the moat, and they are unevenly distributed across labs.

A useful heuristic for thinking about data: a model trained on 15T tokens has seen roughly 10⁴ books worth of text per parameter. For a 70B model that’s ~700T character-equivalents. It’s a lot of text. There is essentially no remaining “more high-quality web text” to add — the frontier is now: better-curated existing text, more code, more math, and synthetic data (Chapter 19).

11.5 Deduplication — the most important data step nobody talks about

Web data is enormously redundant. The same web page is often crawled hundreds of times. Wikipedia content is scraped, mirrored, and re-published across thousands of sites. Code is forked, and forks of forks are forked. News articles are syndicated. Stack Overflow questions are mirrored on a dozen aggregators. A naive Common Crawl dump has duplication factors of 10× or more.

If you train on duplicated data, you teach the model to memorize. Memorization on duplicates is bad for a few reasons:

It wastes capacity. Each time the model sees the same passage, it commits more of its parameter budget to remembering it word for word, instead of learning generalizable patterns.
It overfits. The duplicated content acts like a high-weight training signal, distorting the model’s predictions toward the duplicated style.
It surfaces in eval. Memorized passages show up at inference time as verbatim regurgitation. This is a copyright and privacy headache.

The fix is deduplication: identify near-duplicate documents and remove them. The standard tool is MinHash + locality-sensitive hashing (LSH), which can detect “this document is more than 80% similar to that other document” at scale across trillions of documents. The Lee et al. (2022) paper Deduplicating Training Data Makes Language Models Better showed that aggressive deduplication consistently improved model quality across model sizes, a result that’s been replicated everywhere since.

Deduplication is the single largest data-side improvement available. It’s also the least glamorous. Nobody writes papers about it. The model card mentions it in one sentence. But if you’re trying to understand why one open model is better than another at the same scale, the answer is often better deduplication.

Deduplication (not quality filtering) is the largest single reduction in the data pipeline, cutting a 10× redundant crawl down before any quality signal is even applied.

11.6 Quality filtering

After deduplication, the remaining text is still mostly garbage. Common Crawl is enormously full of low-quality content: SEO spam, machine-translated nonsense, content farms, forum noise, broken HTML extractions, repeated boilerplate, and so on. Training on the unfiltered crawl would produce a model that’s confidently fluent in garbage.

The fix is quality filtering, of which there are several flavors:

Heuristic filters

Simple rules: remove documents with too many short lines, too many duplicated lines, too many bullet points, too many capitalized characters, suspicious symbol-to-word ratios, missing punctuation, etc. These are crude but they cheaply remove the worst offenders.

Examples (from the Pile and CCNet papers):

Lines must have at least N words on average.
The document must contain at least one period.
No more than X% of lines may be identical to other lines.
No more than Y% of characters may be uppercase.

A typical heuristic filter removes 20–40% of Common Crawl by document count.

Perplexity filters (CCNet-style)

Train a small N-gram language model on a “high-quality” reference corpus (Wikipedia is the classic choice). Score every Common Crawl document with this language model and keep only the documents whose perplexity is reasonable — neither too high (garbage that the LM doesn’t understand) nor too low (text so close to Wikipedia that it’s probably scraped Wikipedia, which we already have).

This was the CCNet approach (Wenzek et al., 2019) and it became the de facto standard for the open community.

Classifier filters (FastText, GPT-as-judge)

Train a binary classifier to distinguish “good” text from “bad” text. The “good” examples come from a high-quality source (Wikipedia, books, top-rated forum posts); the “bad” examples come from random web crawl. Apply the classifier to all your training data and keep only the documents the classifier rates highly.

The classifier can be a tiny FastText model (cheap) or a much bigger model. The Llama 2 paper described a more elaborate version where they used a large-model classifier to score data quality, with a cost in compute but a payoff in selectivity. This is the modern frontier-lab approach.

”Phi-style” textbook filtering

Microsoft’s Phi series (Gunasekar et al., 2023) pushed the quality-filtering idea to its limit: instead of filtering web data to remove the worst, curate a tiny corpus of “textbook-quality” text (reference materials, tutorials, math explanations) and train almost exclusively on that. The result was a 1.3B model that punched well above its weight on reasoning benchmarks. The philosophy: less data, much higher quality, scaled-up training-token-per-parameter ratio, beats lots of mediocre data.

The Phi line continues: Phi-2, Phi-3, Phi-4 all chase this idea. The frontier consensus is that “high quality scaled” matters more than “lots of data,” and the labs spend much of their data-side budget on quality classifiers and curation.

The cost of filtering

Quality filtering takes time. You’re running a classifier (or a small LM, or a big LM) over trillions of documents. This is its own multi-day, multi-thousand-CPU job. It’s also one of the things that can’t be parallelized away — you have to score everything before you can train on the filtered subset.

The combined picture: a frontier pretraining run spends roughly as much engineering effort on data preparation as on model training. Maybe more. The data preparation pipeline is not glamorous and rarely makes the model card, but it’s where the moat is.

11.7 Tokenization for pretraining

We covered tokenizers in Chapter 5. The pretraining-specific story is short:

The tokenizer is trained on the same corpus (or a representative sample) before pretraining begins. Modern models train BPE or SentencePiece tokenizers on tens of GB of text, with vocab sizes between 50k and 200k.
The tokenizer is fixed for the entire training run. You cannot retrain it midway without throwing away the pretraining.
The vocab size determines the embedding and LM head size. For a 70B Llama 3 with a 128k vocab and d_model=8192, the embedding+head are about 2.1B parameters — non-trivial.
The tokenizer is part of the model artifact and ships with it. Always.

Practical note: data is tokenized once and then stored in tokenized form (either as raw token-ID arrays, or in a compact binary format like the Hugging Face datasets Arrow format, or in the .bin files Karpathy’s nanoGPT uses). The training loop reads from the tokenized files directly. Re-tokenizing during training is expensive and wasteful.

11.8 The training loop

A pretraining run is, mechanically:

The training loop itself is trivial; the complexity lives in the data pipeline to the left and the distributed coordination surrounding the forward/backward step.

for step in range(total_steps):
    batch = next(data_loader)               # token IDs, shape (N, S)
    logits = model(batch[:, :-1])           # predict next-token at every position
    loss = cross_entropy(logits, batch[:, 1:])
    loss.backward()
    clip_grad_norm(model, 1.0)
    optimizer.step()
    optimizer.zero_grad()
    scheduler.step()
    if step % save_interval == 0:
        save_checkpoint(model, optimizer, step)

That’s it. The same loop you’d write for a tiny MLP, scaled up. The complexity is in everything around the loop:

Data loading. Streaming trillions of tokens off disk into the GPU at a rate that doesn’t bottleneck the GPU. This requires careful buffering, pre-fetching, and parallelism.
Distributed coordination. ZeRO/FSDP/TP/PP all happen here (Chapter 12). The single-line optimizer.step() becomes a complex collective operation across thousands of GPUs.
Checkpoint management. A 70B model in fp32 optimizer state is ~600 GB of checkpoint, written every few hours, often to a parallel filesystem. Checkpoint failures are common; restart logic must be bulletproof.
Health monitoring. Watch for loss spikes, NaN losses, all-reduce failures, GPU hardware failures, network failures, kernel timeouts.
Mid-run adjustments. Sometimes a run shows a loss spike. Sometimes the schedule needs tweaking. Sometimes a hardware failure forces a multi-hour restart. Frontier runs have on-call rotations.

A multi-week pretraining run is more like running an oil rig than running a Python script. The Llama 3.1 paper spent significant text on the operational engineering — they had something like a 90-percentile uptime, with frequent GPU failures, network failures, and software bugs that required mid-run intervention.

11.9 Curriculum learning

Curriculum learning is the idea of training the model on “easy” data first and gradually introducing “harder” data — analogous to how humans learn. It has been studied extensively. The result, after a decade of research, is roughly: it sometimes helps a little, but usually not enough to bother with for general-purpose pretraining.

Variants that do matter:

Length curriculum. Start training with shorter sequences (say, 2k tokens), then gradually increase to the target length (32k or 128k). This is a straightforward speed optimization — short sequences are faster to train on, and the model has to learn local patterns before global ones anyway. Most large pretraining runs do this.
Domain curriculum. Train mostly on web text early, then introduce more code, math, and high-quality data toward the end of the run. This is a form of data reweighting more than a strict curriculum, but the principle is the same. Llama 3 reportedly did this; DeepSeek-V3 explicitly schedules its data mix across the run.
Annealing. At the very end of pretraining, switch to a smaller, very-high-quality data subset and train at a low learning rate for a short period. This “annealing” phase consistently improves benchmark scores and is now standard.

The pure “easy then hard” curriculum (e.g., short documents first, long ones later) has not proven decisive at scale. The reweighting and annealing variants are.

11.10 Monitoring a pretraining run

What does a pretraining team watch in the dashboard?

Training loss (per token). Should be smoothly decreasing. Spikes suggest a bad batch or a numerical instability.
Validation loss on held-out data. Should track training loss closely. Divergence suggests overfitting (rare at pretraining scale) or a data leak.
Gradient norm (before clipping). Should be stable. Spikes indicate instability.
Per-layer activation statistics. Monitored to catch the rare “this layer’s outputs are exploding” failure mode. Modern training scripts log RMS / max of activations per layer.
MFU (model FLOP utilization). The ratio of actual achieved FLOPs to peak theoretical FLOPs. Healthy MFU is 30–50% on H100. Lower means you’re leaving throughput on the table.
Hardware health. GPU temperatures, ECC error counts, network throughput, NCCL hang detection. The bigger the cluster, the more likely something fails per day.
Eval scores on standard benchmarks, run on intermediate checkpoints. These tell you whether the loss decrease is translating to capability improvements.
Loss spikes. When (not if) the loss jumps unexpectedly, the team has to decide whether to roll back and retry, or accept and continue. Llama 3 had a memorable mid-run loss spike that they handled by reverting and skipping a problematic data shard.

The dashboard for a frontier run is a wall of TensorBoard tabs and Grafana panels. Engineers watch them in shifts.

11.11 The cost: dollars, GPUs, weeks, people

A frontier pretraining run in 2025, end to end:

Resource	Frontier dense 70B	Frontier MoE 671B
GPUs	2,000–16,000 H100-equivalent	16,000–32,000 H100-equivalent
Wall-clock	1–3 months	2–6 months
Compute cost	$20M–$100M	$50M–$300M
Engineering team	20–60 people across data, training, eval	similar
Total cost (compute + people + iteration)	~$100M	~$300M+

These are the numbers behind the press releases. The DeepSeek-V3 number ($6M) is striking because they hit it through extreme algorithmic optimization (fp8 training, careful pipeline scheduling, efficient MoE routing) and access to cheap H800 capacity. Most labs are not in that position.

Note that this is just the cost of the final successful run. Most labs run multiple smaller-scale runs to ablate data, architecture, and hyperparameters before committing to the frontier run. Total iteration costs are typically 2–5× the final-run cost.

11.12 Why most companies don’t (and shouldn’t) pretrain

The economics are brutal: $100M of compute, six months of wall-clock, a multi-team engineering org, and a real risk that the resulting model is worse than what’s already openly available. Almost no company should be pretraining from scratch.

The exceptions are narrow:

Frontier labs (OpenAI, Anthropic, Google DeepMind, Meta AI, xAI, Mistral, DeepSeek, Qwen team) where the model itself is the product.
Domain-specialized models where the training data is so different from the open web that fine-tuning isn’t enough — for example, biomedical models trained heavily on PubMed and clinical notes, or financial models trained on transactional data. Even these are often better served by continued pretraining (start from an open base, train further on domain data) rather than from-scratch pretraining.
Sovereignty / language-specific models where a country or language community wants a model that doesn’t depend on a foreign frontier lab.

For everyone else — and that’s almost everyone — the right move is to start from an open base model and fine-tune (Chapter 15) or prompt and RAG (Chapters 55–63). The cost differential is enormous: you can fine-tune a 70B Llama on a few thousand dollars of compute and get a model that’s competitive with the open-source frontier on your specific task. Pretraining gives you nothing extra in that case.

The skill to develop is knowing when not to pretrain. The default answer is “don’t.” If you can’t articulate why your specific case is in the narrow exceptions, the answer remains “don’t.”

11.13 The mental model

Eight points to take into Chapter 12:

Pretraining is expensive but the cost is dropping through better algorithms and hardware utilization, not raw scale.
Compute = 6 × parameters × tokens. Memorize the formula.
Chinchilla said tokens and parameters should grow together. Modern practice trains smaller models on far more tokens because inference cost dominates lifetime cost.
Data sources matter. The mix of web/code/math/books/papers/multilingual is the moat.
Deduplication is the single most important data step. Quality filtering is the second.
The training loop is simple; everything around it is hard. Distributed coordination, checkpoints, hardware failures, monitoring.
Curriculum learning is mostly weak, but length curriculum and end-of-run annealing are real and standard.
Almost no company should pretrain. Start from an open base and fine-tune unless your situation is in the narrow exceptions.

In Chapter 12 we look at the part that makes this all even possible: distributed training at the scale of thousands of GPUs.

Read it yourself

The Chinchilla paper: Hoffmann et al., Training Compute-Optimal Large Language Models (2022). The headline result, derived experimentally.
The Llama 3 paper: Grattafiori et al., The Llama 3 Herd of Models (2024). Read sections on data and on training. The most detailed open frontier-pretraining writeup currently available.
The DeepSeek-V3 technical report (2024). Read for the fp8 training details and the cost-optimization story.
The CCNet paper: Wenzek et al., CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data (2019).
Lee et al., Deduplicating Training Data Makes Language Models Better (2022). The deduplication paper everyone cites.
Gunasekar et al., Textbooks Are All You Need (2023) — the Phi paper.

Practice

Compute the training FLOPs for a 13B model trained on 2T tokens. How long would it take on a 256-H100 cluster at 40% MFU?
If you have $1M to spend on compute, what is the largest dense model you can train compute-optimally per Chinchilla? What is the corresponding token count? (Hint: write the cost as a function of N, set the gradient to zero.)
Why is Llama 3 considered “10× over Chinchilla”? Compute the ratio.
A naive Common Crawl dump has duplication factor ~10×. After deduplication, you have ~1T usable tokens of English text. Is that enough to compute-optimally pretrain a 50B model? If not, what’s missing?
Why does the inference cost dominate the training cost over a model’s lifetime? Make an order-of-magnitude argument with realistic per-token costs and query volumes.
Pick three open LLMs released in the last year. Read their technical reports. Compare their data-source mixes. What do the differences tell you about each lab’s priorities?
Stretch: Implement a tiny pretraining loop in PyTorch on the FineWeb dataset. Train a 100M-parameter transformer for 10 hours on a single H100 and measure tokens per second, MFU, and validation loss. Compare to the published numbers from nanoGPT.

Concept check