SFT and instruction tuning: turning a base model into an assistant
"A base model is a stochastic parrot. An instruct model is a stochastic parrot that has learned to play a role"
In Chapter 15 we covered the methods of fine-tuning (LoRA, QLoRA, full FT). In this chapter we cover what you fine-tune for — specifically, the most common kind of LLM fine-tuning: supervised fine-tuning (SFT) for instruction following. By the end you will know how to take a raw base model and turn it into a chatbot, what dataset format actually works, what catastrophic forgetting is and how to avoid it, and why the chat-template-equals-data invariant matters.
Outline:
- The base model vs the instruct model — the gap.
- The SFT objective: NLL on the response, masked on the prompt.
- Dataset construction — quality vs quantity.
- The chat template, again, with feeling.
- The classics: Alpaca, Dolly, Open-Hermes, ShareGPT, Magpie.
- The synthetic-data shift.
- Catastrophic forgetting.
- Multi-task and continual SFT.
- Hyperparameters — the only ones that matter.
- Common SFT bugs.
16.1 The base vs instruct gap
A pretrained base model is a next-token predictor. Give it "The capital of France is" and it will produce " Paris" followed by some plausible Wikipedia-style continuation. Give it "How do I make a sandwich?" and it will produce… a Reddit thread, or a recipe site, or a Wikipedia article on sandwiches, or whatever the base model thinks comes after that question on the open web. It does not produce a helpful answer to the question, because nothing in pretraining taught it to.
Pretraining teaches the model language modeling: given any prefix, what comes next on the web? This is enormously useful, but it’s not what users want. Users want a model that answers questions, follows instructions, holds a conversation, and stays in role. This is a different task, and it has to be taught explicitly.
The teaching is done with supervised fine-tuning (SFT) on a dataset of (instruction, response) pairs. Each pair shows the model “here’s what a user asked, and here’s what a helpful assistant should say.” The model learns to associate the instruction format with helpful responses, and to produce them on its own.
The base-to-instruct transformation is one of the most consequential single steps in the modern LLM pipeline. A 7B base model that scores 30% on MT-Bench can score 65% after a good SFT round. The improvement isn’t from the model “learning more” — it’s from the model learning the task of being an assistant.
16.2 The SFT objective
The SFT loss is standard next-token cross-entropy, exactly the same as pretraining (Chapter 4). The only difference is what gets included in the loss.
For an (instruction, response) example formatted in the chat template, the input might look like:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
How do I sort a list in Python?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
You can use the sorted() function: sorted(my_list).<|eot_id|>
You feed this entire string into the model and compute next-token loss at every position. But you only count the loss for the response tokens — the tokens after assistant<|end_header_id|>\n\n and before the closing <|eot_id|>. The user’s prompt tokens are masked out of the loss.
Why? Because predicting the prompt is trivial: you literally fed the prompt to the model, so it has no information to predict. Computing loss on those positions just wastes signal — the model is rewarded for echoing input, not for producing helpful answers. Masking the prompt focuses all the gradient signal on the response, which is what we want the model to learn.
In code (HuggingFace trl.SFTTrainer does this for you, but if you write a custom loop):
def compute_loss(model, batch):
# batch has 'input_ids' and 'response_mask' (1 where the response is, 0 elsewhere)
logits = model(batch['input_ids']).logits
shift_logits = logits[:, :-1, :]
shift_labels = batch['input_ids'][:, 1:]
shift_mask = batch['response_mask'][:, 1:]
loss_per_token = F.cross_entropy(
shift_logits.reshape(-1, shift_logits.size(-1)),
shift_labels.reshape(-1),
reduction='none',
).reshape_as(shift_labels)
masked_loss = (loss_per_token * shift_mask).sum() / shift_mask.sum()
return masked_loss
This is the entire SFT training step. Mask the prompt, compute cross-entropy on the response, average. There’s no special loss function, no clever trick, no auxiliary objective — just standard cross-entropy with the prompt masked out.
The “loss masking” point is the single most common bug in custom SFT scripts (mentioned in Chapter 15 as well; it’s worth repeating). Every framework that does SFT correctly handles it; if you write your own loop you have to handle it explicitly.
16.3 Dataset construction — quality over quantity
The single biggest lever in SFT is dataset quality. Good SFT data is:
- Diverse. Covers many task types (Q&A, summarization, translation, code, math, creative writing, roleplay, refusal, multi-turn conversation).
- Helpful. The responses actually answer the questions correctly, in a useful format.
- Well-formatted. Consistent formatting across examples, matching the chat template the model will be served with.
- High-signal. No noise, no broken examples, no responses that contradict themselves or the instructions.
The empirical finding from years of fine-tuning research: more high-quality data wins, but past a certain point, more low-quality data hurts. Specifically, the LIMA paper (Zhou et al., 2023) showed that fine-tuning on just 1,000 carefully curated examples can produce a model that competes with one fine-tuned on 50,000 mediocre examples. The quality matters more than the quantity.
The dataset sizes that have actually shipped:
| Dataset | Size | Notes |
|---|---|---|
| Alpaca (2023) | 52k | Self-instructed from text-davinci-003. Medium quality. |
| Dolly (2023) | 15k | Human-written by Databricks employees. High quality, narrow. |
| Open-Hermes (2023) | 1M+ | Mixed sources, varying quality. |
| LIMA (2023) | 1,000 | Highly curated. Demonstrated quality > quantity. |
| ShareGPT (2023) | ~70k | Real ChatGPT conversations, scraped/exported. |
| UltraChat (2023) | ~1.5M | LLM-generated, multi-turn. |
| OpenAssistant (2023) | ~10k | Crowdsourced, multi-turn, multi-language. |
| Magpie (2024) | varies | Self-generated by prompting Llama-3-Instruct. New paradigm. |
The trend is from scraped data (ShareGPT) to human-written data (Dolly, OpenAssistant) to distilled data (Alpaca, UltraChat, Magpie). The current frontier is heavy synthetic-data generation — Magpie, in particular, gets its instructions by prompting an instruction-tuned model with no input and capturing what it generates as the user query. We’ll come back to synthetic data in Chapter 19.
16.4 The chat template, again
We covered chat templates in Chapter 5 and 10. In SFT, they become absolutely critical because the model learns the template at the same time as the responses.
When you fine-tune a base model, you teach it: “when you see this exact pattern of special tokens (the chat template), produce a response in this format.” After training, the model expects the same template at inference. If the inference template doesn’t exactly match the training template, the model behaves badly.
A few rules:
- Decide the template before you start training. Pick one (Llama 3, ChatML, Alpaca, etc.) and stick with it.
- Verify by tokenizing. Run
tokenizer.apply_chat_template(messages, tokenize=False)on a sample, look at the actual string, and confirm it matches what your dataset is producing. The tokens have to match byte-for-byte. - Don’t change the template after training. If you decide later that you want a different format, you have to retrain.
- Set the template in
tokenizer_config.jsonso it ships with the model. Anyone loading the model gets the right template automatically.
The chat template is part of the model artifact. Forgetting this — or training with one template and serving with another — is one of the most expensive bugs in this part of the pipeline.
16.5 The classics
A short tour of the SFT datasets that mattered:
Alpaca (Stanford, March 2023)
The first influential “self-instruct” dataset. The Stanford team prompted text-davinci-003 to generate 52k (instruction, response) pairs, then fine-tuned LLaMA 1 7B on them. The result was a Llama variant that behaved much more like ChatGPT, at the cost of $600 of OpenAI API credits. Alpaca was the proof of concept that synthetic instruction data works. It also embarrassed Stanford’s lawyers and started the era of “is using ChatGPT outputs to train a competitor model legal?”
Dolly (Databricks, April 2023)
The opposite approach: 15k instruction/response pairs written by humans (Databricks employees). High quality, narrower domain (mostly technical/business questions), and importantly commercially licensed (Apache 2.0). Dolly showed that you could get good fine-tuning results from human-written data, and gave the open-source community a license-clean dataset.
ShareGPT (community, 2023)
A scraped collection of real ChatGPT conversations, exported by users. ~70k multi-turn conversations of varying quality. Used as the basis for Vicuna (LMSYS) and many subsequent models. Legally questionable; OpenAI explicitly forbids using their outputs to train competitors. The community used it anyway.
OpenAssistant / OASST1 (LAION, 2023)
A community-crowdsourced dataset of multi-turn, multi-language conversations. 10k+ conversations with quality ratings from human reviewers. One of the first serious open competitors to the OpenAI-derived datasets.
UltraChat (Tsinghua, 2023)
Synthetic multi-turn dialogue generated by prompting GPT-3.5 to play both the user and the assistant. 1.5M conversations. The model learns to handle longer multi-turn interactions. Used in many strong open models including Zephyr.
Open-Hermes / Hermes (Nous, 2023–2024)
A mix of multiple sources, curated and filtered. Broad coverage. Used as the base SFT data for many fine-tunes.
Magpie (Xu et al., 2024)
The most interesting recent technique. Magpie generates instructions by prompting an aligned model with no input — literally just the start-of-turn token — and capturing what the model generates as the user message. The aligned model has been trained to respond to instructions, and when given an empty input, it tends to generate plausible instructions on its own. Then you have the model respond to its own generated instruction, and you have an (instruction, response) pair — generated entirely by the model itself with no human writing or seed data.
Magpie is a watershed because it cuts out the human entirely. You can generate millions of pairs at the cost of inference. The quality is surprisingly good. Several recent fine-tunes (Magpie-Pro, Magpie-Air) use this approach exclusively. Whether it produces models that are “actually” learning new things or just regurgitating what the teacher knew is an open question; the empirical results speak for themselves.
16.6 The synthetic-data shift
The throughline of the dataset history above: the field moved from human-written data to LLM-generated data, in steps of decreasing human involvement:
- Pure scraped (Common Crawl, web forums): no curation.
- Scraped + filtered (ShareGPT with quality filtering): some human signal.
- Human-written (Dolly, OpenAssistant): direct human labor.
- Human-seeded synthetic (Alpaca, UltraChat): a few human seed instructions, then LLM-generated.
- Pure synthetic (Magpie): no human input at all.
The shift is driven by economics. Human-written data is expensive — Dolly cost Databricks weeks of employee time. Synthetic data is cheap — Magpie generates its data at inference cost, which is orders of magnitude lower. The quality gap has narrowed dramatically over the last two years.
The risks of pure synthetic data:
- Mode collapse. The teacher model has biases, blind spots, and failure modes. Training on its outputs propagates those.
- Diversity loss. The teacher’s distribution is narrower than the human distribution. Repeated synthetic generation tends to amplify the narrowness.
- Hallucination amplification. The teacher hallucinates; the student learns to hallucinate the same things confidently.
- Legal risk. Most aligned models have terms of service forbidding training competitor models on their outputs. OpenAI, Anthropic, and Google all have such clauses. Using their outputs is at minimum a license violation.
The frontier compromise is synthetic-from-our-own-models: lab A uses lab A’s own teacher to generate data for lab A’s next student. This sidesteps the legal risk and lets the lab control quality. Most frontier post-training pipelines work this way now. We’ll cover synthetic data in more depth in Chapter 19.
16.7 Catastrophic forgetting
The classic risk of SFT: the model forgets what it learned in pretraining. If you fine-tune a Llama base on Korean medical questions, you can end up with a model that’s great at Korean medicine and bad at everything else, including basic English, math, and code. The model has overwritten its general capabilities with task-specific ones.
The mechanism: gradient descent on a narrow distribution moves the parameters away from where they were after pretraining. If the dataset is small relative to the change you make to the model (high learning rate, many epochs, large effective batch), you can move far enough to lose the original capabilities.
Mitigations:
(1) Lower learning rate. SFT uses a much lower learning rate than pretraining — typically 1e-5 to 5e-5, vs the ~1e-4 to 3e-4 of pretraining. Lower LR means smaller per-step updates and less drift.
(2) Fewer epochs. Most SFT runs use 1–3 epochs over the dataset. More than 3 epochs is usually overfitting territory; the model starts memorizing the dataset and losing generalization.
(3) LoRA instead of full fine-tuning. LoRA’s low-rank constraint inherently limits how far the model can drift from its pretrained weights. The base weights are frozen entirely — only the small adapter changes. This makes catastrophic forgetting much less likely.
(4) Multi-task data mixing. If you fine-tune on a narrow task, also include some general-purpose data in the mix to keep the model’s broad capabilities active. The Llama 3 instruct tuning reportedly mixed dozens of task types.
(5) Continued pretraining as a precursor. If you’re adapting a model to a very different domain (e.g., medical text), do a phase of continued pretraining on the domain corpus before SFT. The model adapts gradually to the new distribution rather than being shocked by the SFT step.
The empirical pattern: catastrophic forgetting is real but mostly avoidable with sensible hyperparameters and LoRA. The teams that hit it hardest are the ones that do full fine-tuning at high LR for many epochs on narrow datasets.
16.8 Multi-task and continual SFT
A modern SFT dataset is rarely single-task. A frontier instruction tune mixes:
- General Q&A
- Coding tasks (Python, JavaScript, SQL)
- Math word problems
- Multi-turn dialogue
- Roleplay
- Summarization
- Translation
- Refusal of harmful requests
- Tool-use / function-calling examples
- Constitutional/safety-related responses
The mixing strategy matters. The simplest approach is uniform sampling from each task. The more sophisticated approach is temperature-scaled sampling (similar to multilingual tokenizer training in Chapter 14): upweight rare tasks, downweight common ones, control diversity via a temperature parameter.
Continual SFT is the practice of repeatedly fine-tuning the same model with new data over time, as new tasks arise. Each new fine-tune is a small additional training run on top of the previous model’s weights. The risk is compounding catastrophic forgetting: each fine-tune drifts the model a little, and after many rounds you’re far from where you started.
The mitigation is to always include some of the original SFT data in each new round, so the model is reminded of its previous capabilities. This is sometimes called “rehearsal” in the continual learning literature.
For most production teams, continual SFT is overkill. The simpler pattern is “rebuild the SFT dataset from scratch when something new comes up, retrain from the base, deploy.” This gives you a clean lineage and avoids the drift issue. The cost is paying for the full training again.
16.9 Hyperparameters that matter
The set of hyperparameters that actually matter for SFT, in rough order of importance:
Learning rate. Use 1e-5 to 5e-5 for full fine-tuning, 1e-4 to 5e-4 for LoRA. LoRA can tolerate higher LRs because the base weights are frozen.
Number of epochs. 1–3. More is usually overfitting. Watch the eval loss; stop early if it stops decreasing.
Batch size. Larger is usually better for stability, but more expensive. Effective batch size of 128–512 examples is standard. Use gradient accumulation to reach this on small hardware.
Sequence length. Set to the longest example in your dataset (or truncate to a sensible max like 4k–8k). Longer sequences are more expensive both in memory and in compute.
Warmup. ~3% of total steps as linear warmup. Less critical for SFT than for pretraining (because you’re starting from already-trained weights), but still helpful.
LR schedule. Cosine decay to 10% of peak. Same as pretraining.
LoRA rank. 16 is the default. 32 for harder tasks.
LoRA alpha. Set to r or 2r.
LoRA dropout. 0.05 is the conventional default. Helps prevent overfitting.
That’s the list. Most other things either don’t matter at SFT scale or are determined by the framework. Don’t overtune; the defaults are usually good.
16.10 Common SFT bugs
The bugs that come up over and over in real SFT runs:
(1) Loss masking missing. Already mentioned. The biggest one. Fix: use trl.SFTTrainer or its equivalent.
(2) Chat template mismatch between training and inference. Fix: use tokenizer.apply_chat_template consistently in both.
(3) Trained on too few epochs vs too many. The model either hasn’t learned the task or has overfit. Watch validation loss; the right answer is usually the epoch where val loss is at its minimum.
(4) The dataset has leakage from the eval set. Common when scraping from the internet — your eval prompts are also somewhere in the SFT data. The model “learns” them and inflates eval scores. Fix: dedupe the eval set against the training set.
(5) Catastrophic forgetting on general benchmarks. Already covered. Fix: lower LR, fewer epochs, or LoRA instead of full FT.
(6) Weird response cutoffs at inference. Usually a tokenizer EOS bug — the model isn’t learning to emit EOS, or the inference code isn’t stopping on it. Verify the EOS is the same in training and inference.
(7) The model produces the same response style for every prompt. Sign of either too few epochs (model hasn’t yet learned the patterns) or too small a dataset (model is overfitting to a narrow style). Increase data diversity.
(8) The model is too safe or too unsafe. Refusal calibration is hard. You may need to add more refusal examples (for “too unsafe”) or fewer (for “too safe”). This is an active part of post-training; we’ll cover it more in Chapter 17.
(9) Unexpected language drift. The model starts responding in the wrong language. Usually a sign of an unbalanced multi-language SFT mix. Fix the data balance.
(10) The model loses its system prompt. It follows the system prompt for the first turn or two and then drifts. Sign that the SFT data didn’t include enough multi-turn examples with system prompts. Add them.
16.11 The mental model
Eight points to take into Chapter 17:
- A base model is not an assistant. SFT teaches it the role.
- The SFT objective is standard cross-entropy with the prompt masked out. Loss masking is non-negotiable.
- Quality > quantity. A few thousand high-quality examples beat tens of thousands of mediocre ones.
- The chat template is part of the model. Train and serve with the same template, byte-for-byte.
- Synthetic data dominates the modern dataset stack. Magpie-style generation is the cheap-and-clean default.
- Catastrophic forgetting is real but avoidable with low LR, few epochs, LoRA, and multi-task data mixing.
- Hyperparameters are mostly defaults. Don’t overtune. LR ~
1e-5for full FT, ~1e-4for LoRA. - Loss masking is the #1 bug. Always use a framework that handles it, or test for it explicitly.
In Chapter 17 we add the alignment layer on top of SFT: RLHF, DPO, KTO, Constitutional AI, and the techniques that take a “behaves like an assistant” model and make it “behaves like the assistant we actually want.”
Read it yourself
- Zhou et al., LIMA: Less Is More for Alignment (2023). The “1000 examples is enough” paper.
- The Alpaca blog post and technical writeup (Stanford, 2023).
- The Dolly 2.0 dataset and writeup from Databricks (2023).
- Xu et al., Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing (2024).
- The HuggingFace
trllibrary documentation, especiallySFTTrainer. - Wei et al., Finetuned Language Models Are Zero-Shot Learners (FLAN, 2022) — the foundational instruction-tuning paper.
Practice
- Why does masking the prompt out of the SFT loss matter? Construct a tiny example where unmasked loss gives a much smaller signal than masked loss.
- Train a tiny SFT on a 7B base using the Alpaca dataset and
trl.SFTTrainer. Save the model and compare its responses to the base model on five prompts. - The LIMA paper claims 1000 carefully curated examples is enough. Read the paper. What counts as “carefully curated”? What’s the upper bound on what 1000 examples can teach?
- Design an SFT dataset for a customer support assistant that needs to follow company policies. What would you include? How would you handle refusal cases? What’s the minimum size you’d start with?
- Why does LoRA reduce the risk of catastrophic forgetting? Trace what happens to the base model’s weights in a LoRA fine-tune vs a full fine-tune.
- A model trained on a dataset with ChatML format is being served with a Llama-3 chat template. What goes wrong, and how can you tell from the outputs?
- Stretch: Use Magpie-style self-prompting to generate 100 (instruction, response) pairs by prompting a Llama 3 instruct model with just the start-of-turn token. Inspect the quality. Train a small LoRA on them. Does it work?
Concept check
4 questions. Click a choice to check. Your score is saved locally.
- 1. During SFT, loss is typically masked on the prompt and computed only on the response tokens. Why?
- 2. What is catastrophic forgetting in the context of SFT, and what is the primary mitigation?
- 3. Why does dataset quality matter more than dataset quantity for SFT?
- 4. A model trained with SFT using chat template A is deployed behind a front-end that applies chat template B. What failure mode should you expect?