Synthetic data and self-improvement loops
"Run out of high-quality human text? Use a model to write more"
We hit a quiet but enormous transition in the LLM field around 2023–2024. For most of the deep learning era, “training data” meant data that humans wrote: scraped from the web, scraped from books, sometimes paid for. The amount of high-quality human-written text on the planet is bounded, and the frontier labs ran out. The next move was inevitable: use models to generate the data they need to train their next models.
This is synthetic data, and it now dominates the post-training pipeline of every frontier model. SFT data is mostly synthetic. Preference data is mostly synthetic. Distillation data is by definition synthetic. Even some pretraining data is now synthetic. The chapter is about how this works, why it works, what its risks are, and where the field is going.
Outline:
- The data cliff and why synthetic data became inevitable.
- Self-instruct and the Alpaca-style pipelines.
- Magpie — synthetic data without seed prompts.
- Rejection sampling — generate many, keep the best.
- Constitutional AI as synthetic data.
- Distillation as synthetic data.
- The synthetic-data-for-pretraining frontier.
- Model collapse and the diversity problem.
- The legal risk surface.
- Where the field is going.
19.1 The data cliff
Pretraining on web text was the engine of GPT-3 → GPT-4. Web text is finite. The Common Crawl corpus contains roughly 300–500 trillion tokens of raw text, of which only ~15% survives strict deduplication and quality filtering. After the strictest filters, you have maybe 50 trillion tokens of “high-quality” web text — and that’s after the filters that some labs use, which are themselves controversial because they discard valuable content.
Frontier models in 2024 were trained on 10–15 trillion tokens. That’s a third of the available high-quality web text already used by a single training run. The next generation of models — Llama 4, GPT-5, Gemini 2 — wants to train on more tokens than there are high-quality human tokens to use.
This is the data cliff, and it’s the proximate cause of the synthetic-data shift. The remaining options are:
- Lower the quality bar. Train on noisier text. The quality gain from more data has to outweigh the quality loss from lower per-token quality. Limited upside.
- Translate from other languages. Add multilingual data to the English-heavy training mix. Real but bounded.
- Use other modalities. Train on transcribed audio (YouTube, podcasts), OCR’d PDFs, image captions. Real and growing, but still bounded.
- Generate synthetic data. Use existing models to produce new training data. Unbounded — at least in principle.
The frontier labs are doing all four. Synthetic data is the one that’s growing fastest and the one with the most upside.
19.2 Self-instruct and Alpaca-style pipelines
The first influential synthetic-data paper was Self-Instruct (Wang et al., 2022), which described a pipeline for generating instruction-tuning data from a model. The pipeline:
- Seed. Start with ~175 human-written instructions covering a variety of tasks.
- Generate. Prompt a strong LLM (originally GPT-3) to generate new instructions similar in style to the seeds, but different in content. Sample many.
- Filter. Remove duplicates, near-duplicates, and instructions that look bad (too short, too long, not actually instructions).
- Generate responses. For each new instruction, prompt the LLM to write a response.
- Filter again. Remove (instruction, response) pairs where the response looks bad.
- Use as training data. SFT a target model on the filtered pairs.
Self-instruct produced ~52k instructions from the ~175 seeds. Stanford’s Alpaca project applied the same pipeline with text-davinci-003 as the generator, fine-tuned LLaMA 1 7B on the result, and produced a model that performed near-ChatGPT-quality on common chat tasks. The cost was ~$600 of API calls. This was 2023’s “wait, that’s it?” moment.
The Self-Instruct / Alpaca paradigm has limitations:
- The seeds matter. The 175 seed instructions shape what kinds of new instructions get generated. A narrow seed set produces narrow synthetic data.
- The teacher’s biases propagate. Whatever the teacher is bad at, the student inherits. Whatever the teacher hallucinates, the student learns to hallucinate.
- The student inherits the teacher’s style. Alpaca-trained models sound like text-davinci-003 because they were trained on text-davinci-003 outputs. This is fine if you like that style, awkward if you don’t.
The Self-Instruct paradigm was the foundation. Subsequent work has refined it in many directions.
19.3 Magpie — synthetic data without seeds
The most elegant recent advance in synthetic data is Magpie (Xu et al., 2024), which we touched on in Chapter 16. The trick: skip the seed prompts entirely. Instead, prompt an aligned model with just the start-of-turn token (or the equivalent for that model’s chat template) and let it generate the user’s message.
# Example for Llama 3 Instruct
prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"
# Now sample from the model. It will generate what it thinks
# the user might say, because that's what comes next in its
# chat template.
generated_user_message = model.generate(prompt, ...)
The aligned model has been heavily trained to produce assistant responses given user messages. When you give it nothing — just the role token — it falls back on what it thinks user messages typically look like. The result is a wide distribution of plausible user queries, with no human seed input.
You then take the generated user message and have the model respond to it normally. You have an (instruction, response) pair, generated entirely by the model with zero human input.
Magpie’s results are striking: training Llama 3 8B on Magpie-generated data produced a model that competed with the original Llama 3 8B Instruct on chat benchmarks, despite using completely different SFT data.
The Magpie paradigm is now standard for cheap SFT data generation. Several Magpie variants exist:
- Magpie-Pro: filtered version of the original Magpie.
- Magpie-Air: lighter, smaller dataset.
- Magpie-Reasoning: variant tuned for reasoning tasks.
The frontier labs use similar self-prompting techniques with their own internal models. The technique is so cheap that it’s now the default path for SFT data generation.
19.4 Rejection sampling
A different synthetic-data flavor: rejection sampling. The idea: have the teacher generate multiple candidate responses for each prompt, then keep only the best ones (according to some quality signal), and use those as training data.
The “quality signal” can be:
- The teacher’s own self-rating. Have the teacher evaluate its own responses.
- A reward model. Use a reward model trained on human preferences to score the candidates.
- A verifier model. For tasks with checkable outputs (code that compiles, math that’s correct), a verifier program decides which candidates pass.
- Vote-then-keep. Generate K candidates, take the most common answer (good for math and reasoning).
Rejection sampling is the technique behind many recent reasoning-model improvements. The idea: if your model can sometimes solve a hard problem and sometimes fails, you generate many attempts, keep the successful ones, and train the model on those. This is “self-improvement” in the most literal sense — the model gets better by learning from its own successes.
The risks:
- Mode collapse. If you only train on the model’s most-confident (and therefore narrowest) outputs, the model collapses to a smaller distribution.
- Confirmation bias. Rejecting samples that “look wrong” can reject correct-but-unconventional answers, narrowing the space of accepted reasoning.
- Verifier exploitation. The model learns to game the verifier, not to actually solve the task.
Rejection sampling works best when there’s an objective verifier for the task (math, code, structured generation). For open-ended chat tasks, the quality signal is too noisy and the rejection process introduces biases.
The recent reasoning models — o1, R1, and the open-source replications — rely heavily on rejection sampling on math/code datasets to bootstrap their reasoning abilities.
19.5 Constitutional AI as synthetic data
We covered Constitutional AI in Chapter 17 as an alignment technique. It’s also a synthetic-data technique.
Recall the CAI loop:
- The model generates a response.
- The model critiques its own response against a written constitution.
- The model rewrites the response to address the critique.
- The (original, revised) pair becomes a preference pair: original is “rejected,” revised is “chosen.”
The output is a fully synthetic preference dataset with no human labelers in the preference loop. Anthropic uses (variants of) this to scale safety alignment — labelling safety data is expensive and emotionally taxing for humans, so doing it with a model is both cheaper and (arguably) kinder.
CAI works because the critique step has more context than the original generation. The model writing the critique knows about the constitution, has had time to “think,” and is being asked an easier task (“does this follow the rules?”) than the original task (“write a good response to this prompt”). Even if the same model produces both, the critique can be better than the original because of the additional structure.
This is a generalizable pattern: easier verification than generation. We see it in math (checking a proof is easier than writing it), in code (running a test is easier than writing the function), in chat (judging two responses is easier than writing one). Synthetic data techniques exploit this asymmetry.
19.6 Distillation as synthetic data
Knowledge distillation (Chapter 18) is, formally, a synthetic data pipeline. The teacher generates outputs; the student is trained on them. The “data” the student trains on is entirely synthetic.
Most of what people call “distillation” is hard-target distillation: take the teacher’s argmax response, treat it as ground truth. This is the same shape as Self-Instruct, just with the framing of “teacher transferring knowledge to student” rather than “generating training data.”
The lines between distillation, synthetic data, and rejection sampling have blurred. A typical modern post-training pipeline does all three in different stages:
- Generate diverse instructions with Magpie-style self-prompting.
- For each instruction, sample K responses from a strong teacher.
- Keep the best response per instruction (rejection sampling, with the verifier or another judge model).
- SFT the student on the (instruction, kept response) pairs.
- Generate preference pairs by sampling more responses and ranking them.
- DPO the student on the preference pairs.
The whole pipeline is synthetic. The only humans involved are the people designing the prompts and the verifiers, not the people writing the data.
19.7 Synthetic data for pretraining
The most contentious synthetic data trend: using model-generated text for pretraining, not just post-training.
The Phi series (Phi-1, Phi-2, Phi-3, Phi-4) is the most influential example. Microsoft’s claim: instead of pretraining on the open web, generate “textbook-quality” educational content with a frontier model, and pretrain a small model exclusively on that. The result is a small model with surprising capabilities, especially on reasoning and code.
The Phi-3 paper reported pretraining on 3.3 trillion tokens, of which a substantial fraction (the exact percentage is not disclosed) was synthetic. The model is much stronger per-parameter than non-synthetic-trained models of the same size.
Two things make this work:
- Quality > quantity. Synthetic data can be uniformly high quality in a way that web text can’t. You can filter out the bad samples and keep only the good ones.
- Curriculum control. You can deliberately generate data on the topics where the model is weakest, in the proportions you want. Web text gives you what humans happen to have written; synthetic gives you what you want to learn.
The skeptical case: synthetic-pretrained models are good at the things their teacher was good at and bad at the things their teacher was bad at. The Phi models, for example, do unusually well on textbook-style benchmarks (where the synthetic data has the same shape as the eval) and less well on open-ended conversational tasks. They’re optimized for the synthetic distribution, not for the natural distribution.
The frontier consensus, as of late 2025: synthetic data is real and increasingly important, but not yet a complete replacement for web pretraining. The mix is shifting from 100/0 (web/synthetic) toward something like 50/50 over time. We’ll see where it lands.
19.8 Model collapse and the diversity problem
The deep risk of a synthetic-data future: model collapse. If you train model N on the outputs of model N-1, and model N+1 on the outputs of model N, and so on, the distribution narrows over generations.
This was demonstrated formally by Shumailov et al. (2024) in The Curse of Recursion: Training on Generated Data Makes Models Forget. The paper showed that recursive training on synthetic data leads to progressively worse models, particularly in the tails of the distribution.
The mechanism: every generative model makes errors, and the rare events (the long tail) are the easiest to under-sample. When you train the next model on the previous one’s outputs, the rare events are even more under-represented. By generation 10, the long tail is gone.
Mitigations:
- Mix in real data. Always train on a substantial fraction of human-written data, even when generating synthetic. The real data anchors the distribution.
- Diverse teachers. Use multiple different teachers, not just one. The biases of any single teacher get diluted.
- Active diversity injection. Deliberately generate data covering underrepresented topics, styles, or domains.
- Reject too-confident generations. Filter out outputs that are too “narrow” — they’re a sign of mode-collapse-in-progress.
In practice, the labs avoid model collapse by always training on a mix of real and synthetic data, and by using multiple teacher models and explicit diversity checks. The danger is real but it can be managed.
The longer-term concern is the open web becoming polluted with model-generated text. As more of the web becomes synthetic (LLM-generated articles, blog posts, code), future scrapes for “human-written” data are increasingly contaminated with model outputs. Researchers are starting to worry about this seriously.
19.9 The legal risk surface
Synthetic data is a legal minefield. The major risks:
(1) Teacher license violations. OpenAI, Anthropic, Google, and Meta all have terms of service forbidding using their API outputs to train competing models. Stanford’s Alpaca was technically a violation; Vicuna was a violation; many of the early open-source instruction-tuned models were violations. Legal action has been threatened but rarely pursued. The legality is murky.
(2) Copyright claims through synthetic data. If a teacher model was trained on copyrighted data and then “outputs” near-verbatim copies of that data, the synthetic data might inherit copyright issues from the underlying training. Several lawsuits in 2024–25 explore this question; nothing is settled.
(3) Privacy claims. If the teacher memorizes personal data from its training set and emits it in synthetic generations, the synthetic data contains leaked PII. This is a real risk in any pipeline that uses a frontier model as teacher.
(4) Attribution problems. Synthetic data has no clear authorship. If the resulting model produces text that’s “in the style of” a copyrighted author, who’s liable?
The frontier labs handle this by using only their own teachers for synthetic data. OpenAI uses GPT-4 to generate data for GPT-5. Anthropic uses Claude to generate data for Claude. The legal status is clean because they own both ends.
For open-source teams, the cleanest approach is to use open-source teachers under permissive licenses (Llama, Qwen, Mistral) or synthetic data from your own previous models. Avoid using outputs from API-only models (OpenAI, Anthropic, Google) unless you’re confident about the legal exposure.
19.10 Where the field is going
The trends to watch:
(1) Synthetic data for reasoning. Math and code reasoning is the area where synthetic data is most clearly winning. You can verify correctness, you can reject failures, you can generate millions of correct reasoning traces. The o1-style and R1-style reasoning models all use synthetic reasoning data heavily.
(2) Multi-teacher and ensemble synthetic data. Generating with multiple different teachers reduces bias and increases diversity. The trend is toward “use 5 different teachers and average / reject” rather than “use one teacher only.”
(3) Self-improvement loops. The most provocative direction: a model generates training data, trains itself on the data, becomes slightly better, generates better data, and so on. This is the loop the frontier labs are most invested in for the next generation of models. Whether it actually scales remains to be seen.
(4) Synthetic pretraining at scale. Phi started this; the question is how far it can go. Can you pretrain a 70B model entirely on synthetic data and match the quality of the same architecture trained on web data? Nobody has demonstrated this yet, but the intermediate results (50% synthetic) are promising.
(5) Verifier-guided generation. Using formal verifiers (proof assistants for math, type checkers and test runners for code, structured constraint solvers for other domains) to guide synthetic data generation. The model proposes; the verifier checks; only the verified outputs become training data.
The upshot: synthetic data is a one-way trapdoor. Every lab that adopts it gets cheaper data, faster iteration, and more control over the training distribution. None of them are going back. The question is just how far the technique can push, and how it interacts with model collapse risks.
19.11 The mental model
Eight points to take into Chapter 20:
- The data cliff is real. High-quality web text is finite and the frontier has used most of it.
- Self-instruct / Alpaca — generate instructions and responses from a strong teacher. Cheap and effective for SFT data.
- Magpie — generate instructions by prompting an aligned model with no input. Self-improvement without seeds.
- Rejection sampling — generate many candidates, keep the best. Best when there’s an objective verifier.
- Constitutional AI as synthetic data — model self-critique generates preference pairs. Scales safety alignment.
- Distillation is synthetic data — same idea, different framing. The line is blurry.
- Model collapse is the long-term risk: recursive training on synthetic data narrows the distribution. Mitigated by mixing in real data.
- Legal risk is real but most labs sidestep it by using only their own teachers.
In Chapter 20 we close out Part II with the hardest unsolved problem in ML: evaluation.
Read it yourself
- Wang et al., Self-Instruct: Aligning Language Models with Self-Generated Instructions (2022). The original Self-Instruct paper.
- Xu et al., Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing (2024).
- Shumailov et al., The Curse of Recursion: Training on Generated Data Makes Models Forget (2024). The model collapse paper.
- Gunasekar et al., Textbooks Are All You Need (2023) and follow-up Phi technical reports. The synthetic-pretraining manifesto.
- Bai et al., Constitutional AI: Harmlessness from AI Feedback (2022).
Practice
- Run Magpie on a small Llama 3 Instruct model: prompt it with just the start-of-turn token and capture 100 generated instructions. Inspect their diversity. How many distinct “topics” do you get?
- Implement a tiny rejection-sampling pipeline for a math task: generate K=10 attempts at a math problem, accept the ones that match the correct answer, reject the others. Train a small model on the accepted ones.
- Why does Constitutional AI’s “critique step” produce useful signal even though the same model is doing both the original generation and the critique? Construct a thought experiment.
- Design a synthetic-data pipeline for a customer support assistant. What teachers would you use? What would you reject? How would you avoid model collapse?
- Read the Curse of Recursion paper. The collapse is shown over many generations. Estimate how many generations of synthetic-only training you could do before significant collapse.
- What’s the legal status of training a commercial open-source model on outputs from an OpenAI API call? What about training on outputs from Llama 3 Instruct? Compare the two cases.
- Stretch: Build a multi-teacher synthetic data pipeline that generates the same instruction with three different teachers and keeps the response that all three agree on. Measure the diversity vs single-teacher generation.
Concept check
4 questions. Click a choice to check. Your score is saved locally.
- 1. Self-Instruct generates SFT training data without human-written examples for each task. What is its core mechanism?
- 2. Rejection sampling generates many candidate responses per prompt and keeps only the best ones. What quality signal is typically used to select which responses to keep?
- 3. Model collapse refers to the degradation in diversity and quality that occurs when models are trained on data generated by earlier model generations. What is the proposed root cause?
- 4. Constitutional AI uses a model's self-critique to produce preference data without human raters. Why does this still require a high-quality base model to function correctly?