Chapter 18: Distillation, pruning, and training-time compression

In Chapters 11–17 we built models. In this chapter we shrink them. The setting: you have a model that’s good but too big or too slow for your serving budget, and you want a smaller model that captures most of the quality. The standard answers are knowledge distillation (train a small model to mimic a big one), pruning (remove weights that don’t matter), and a few related techniques. Quantization is a fourth way to compress, but it’s a serving-time technique and we cover it separately in Chapter 26.

This chapter is shorter than the surrounding ones because the algorithmic ideas are simple — the work is in the practice. By the end you will be able to:

Explain why distillation works and what “dark knowledge” means.
Implement a basic distillation training loop.
Place pruning and the lottery ticket hypothesis in context, and explain why pruning has not displaced quantization in practice.
Pick the right compression strategy for a given latency budget.

Outline:

The compression frontier: when to compress and how much.
Knowledge distillation — the framework.
Hard targets vs soft targets, and the role of temperature.
Why distillation works — the dark knowledge story.
Distillation in production — frontier-to-small workflow.
Pruning — structured vs unstructured.
The lottery ticket hypothesis.
Why pruning has not displaced quantization.
The compression frontier: combining techniques.
Picking a compression strategy.

18.1 The compression frontier

Compression is rarely the first thing you reach for. The pre-compression questions:

Can I use a smaller model instead? If a 7B already exists that’s good enough, use it. Distillation and pruning are for the case where no smaller model exists at the quality you need.
Can I quantize? For inference, quantization (Chapter 26) is the cheapest, simplest, and most effective compression technique. INT8 or INT4 cuts memory and increases throughput with small quality loss. You almost always quantize before you consider distillation or pruning.
Can I cache? Prefix caching (Chapter 29) often gives bigger latency wins than compression.
Can I use a different architecture? Sometimes the right answer is “use a different model with a different architecture that’s faster at the same parameter count.”

Once you’ve exhausted these, distillation and pruning come into play. Specifically:

Distillation is for when you want a smaller architecture (different parameter count) to match the behavior of a larger one. The output is a new model, smaller and trained from scratch (or from a pretrained smaller base) to mimic the larger one.
Pruning is for when you want to keep the same architecture but remove weights that don’t matter. The output is a sparse version of the original model.

The ideal use of distillation: you have a frontier 70B+ model that produces gold-standard outputs but is too expensive to serve, and you want a 7B–13B model that produces 90% of the quality at 10× lower cost.

18.2 Knowledge distillation — the framework

Knowledge distillation (Hinton et al., 2015) is the technique of training a smaller “student” model to mimic the output of a larger “teacher” model. The teacher does inference on a large pool of inputs and produces predictions; the student is trained to match those predictions.

The framework is simple:

Pick a teacher model — a strong model whose behavior you want to capture.
Pick a student architecture — a smaller model that you want to train.
Generate training data — run the teacher on a corpus of inputs. The “labels” are the teacher’s outputs (not the original ground-truth labels).
Train the student to match the teacher’s outputs on the training data.

That’s the core of it. The student learns to imitate the teacher. After training, you serve the student instead of the teacher; the student is much smaller and faster.

The interesting question is what “match the teacher’s outputs” means. There are two flavors.

Distillation bypasses ground-truth labels entirely — the teacher's predictions become the student's training targets, transferring the teacher's behavior in compressed form.

18.3 Hard targets vs soft targets

Hard targets

The simplest approach: take the teacher’s most-likely output (the argmax token) and use that as the training label for the student. This is just standard SFT, with the teacher’s outputs in place of human-written ones.

This works, and it’s how most modern “frontier-to-small” distillation pipelines start. You generate millions of (prompt, teacher_response) pairs and SFT the student on them. Alpaca was an early example: take text-davinci-003 outputs, treat them as labels, fine-tune Llama on them. The student learns to mimic the teacher’s behavior.

But you’re throwing away information. The teacher has a full probability distribution over the vocabulary at each step — the student only sees the top-1 choice. The “almost as likely” alternatives, which contain real information about how the teacher thinks, are lost.

Soft targets

The richer approach: instead of taking the argmax, use the full probability distribution from the teacher as the training target. The student is trained to match the entire distribution, not just the top choice.

The loss is the KL divergence between the student’s predicted distribution and the teacher’s:

L = KL( p_teacher || p_student )

For each token position, the teacher emits a distribution p_teacher over the vocabulary, the student emits p_student, and we minimize the KL divergence between them. This is more informative than the cross-entropy on the top-1 choice because it conveys “and here’s how confident I was in the alternatives” — the teacher’s uncertainty itself is part of the training signal.

In practice, soft-target distillation requires you to save the teacher’s logits for every training token, which is enormous storage (vocab size × tokens × dtype). For a 100k vocab and 10M tokens of training data, that’s 100k × 10M × 4 bytes = 4 TB of stored logits. The fix is to truncate to the top-K logits (e.g., top 100) and assume the rest are zero. This loses some information but compresses the storage to ~4 GB.

The Hinton paper (2015) introduced the temperature trick to make soft targets more useful. The teacher’s softmax is computed at a higher temperature (T > 1) before being used as the target:

soft_label = softmax(teacher_logits / T)

A higher temperature flattens the distribution, making the small probabilities (the “I considered this but picked something else” ones) more visible.

Temperature scaling reveals the teacher's "dark knowledge" — at T=4, the student can see that tok B was the runner-up and tok C was plausible, information that is invisible in the hard-target argmax.

This is the **dark knowledge**: the information that's in the relative magnitudes of the unlikely classes, not in the top class. Without temperature scaling, the top class dominates and the dark knowledge is washed out.

Combining hard and soft

A common approach is to combine both: the student loss is a weighted sum of the hard-target cross-entropy and the soft-target KL divergence.

L = α × CE(student, hard_label) + (1 - α) × T² × KL(softmax(student / T), softmax(teacher / T))

The T² factor is needed to keep the gradient magnitudes balanced as T changes (a Hinton paper detail). The α weight is typically 0.1 to 0.5.

This combined loss gives the student both the “pick the right answer” signal and the “and here’s how confident the teacher was” signal. It’s the standard recipe for serious distillation.

18.4 Why distillation works — dark knowledge

The reason distillation works is more interesting than the algorithm. The teacher is a function from inputs to probability distributions. The student is being trained to approximate that function. With hard targets, you’re showing the student “here’s where the function lands”; with soft targets, you’re showing “here’s the full output, including all the information about which alternatives are nearby.”

The soft-target distribution carries information that hard targets don’t. Consider a simple example: the teacher classifies an image of a “Dalmatian” and outputs:

Dalmatian: 0.85
Pointer:   0.10
Beagle:    0.04
Cat:       0.001
Truck:     0.0001

The hard target is just “Dalmatian.” But the soft distribution tells you a lot more: “this is a dog, specifically a hunting-dog-shaped dog, and definitely not a cat or a truck.” A student trained to match the full distribution learns these similarity structures. The structure is not explicitly encoded anywhere; it’s an emergent property of the teacher’s learned function. Hinton called it “dark knowledge” — it’s there but you can’t easily see it.

For language models, the dark knowledge is even richer. The teacher’s distribution over the next token encodes its understanding of grammar, semantics, world knowledge, and stylistic preferences. A student trained to match this distribution inherits a compressed version of the teacher’s knowledge in a way that pure SFT on hard labels doesn’t.

This is why distillation can produce a small student that’s better than what you’d get by training the same small architecture on the original task data. The teacher acts as a richer label source.

18.5 Distillation in production — frontier-to-small workflow

The current frontier-to-small distillation pipeline used by major labs and well-funded teams looks roughly like:

Pick a teacher. A frontier-class model that you have access to (your own internal model, or a model you can call via API at scale).
Generate a large pool of inputs. Tens of millions of diverse prompts covering the tasks you care about. The diversity matters more than the volume.
Run the teacher on every input. Save the teacher’s responses (and optionally, the top-K logits at every step).
SFT the student on the (prompt, teacher_response) pairs. This is the bulk of the distillation. The student starts from a pretrained smaller base (not from scratch) and is fine-tuned with the teacher’s outputs as the labels.
Optionally distill with soft targets. If you saved the teacher’s logits, do a second phase of training with the KL-divergence loss against the soft targets. This typically gives 1–3 points of additional benchmark quality.
Optionally do DPO (Chapter 17) with preference pairs generated by the teacher itself: have the teacher rank pairs of student responses, then train the student to prefer the teacher’s preferences.

This workflow is how you end up with models like Phi-3-mini (3.8B parameters, distilled from a much bigger teacher), Mistral 7B (allegedly distilled from a larger Mistral, though they don’t say), and many specialized small models.

The cost: distillation is typically 10–100× cheaper than pretraining the student from scratch, because you’re starting from a good base and only fine-tuning. You can distill a 7B model from a 70B teacher for a few thousand dollars of compute.

The legal catch: the teacher’s outputs are subject to its license. If the teacher is GPT-4 and the student is open-source, the GPT-4 terms forbid using its outputs to train competing models. The OpenAI clause is famously restrictive. Most labs avoid this by using their own teacher models. When you see an open-source distilled model, you should look at which teacher was used and whether the legal status is clean.

18.6 Pruning — structured vs unstructured

Pruning removes weights from a trained model to make it sparser (and therefore smaller and faster). The idea is that most of the weights in a trained network are doing little useful work, and you can remove them with minimal quality loss.

There are two flavors:

Unstructured pruning

Set individual weights to zero. The simplest version is magnitude pruning: rank all weights by their absolute value, set the smallest X% to zero, fine-tune to recover the lost performance, repeat.

The result is a sparse weight matrix where most entries are zero but a few are nonzero in arbitrary positions. This compresses the storage (you can store the indices and values of the nonzero entries) but does not speed up matmul on standard GPU hardware. GPUs are designed for dense matrix multiplication; sparse matmul is slower in practice unless you have specialized sparse-matrix support (which most GPUs don’t, and which requires very high sparsity ~95%+ to actually win).

So unstructured pruning is good for storage, bad for inference speed on commodity hardware. It’s mostly useful in research and in deployment scenarios with custom sparse-matrix accelerators.

Structured pruning

Remove entire structures — full attention heads, full FFN columns, full layers, full channels. The pruned model is smaller and denser, so it runs at full GPU efficiency.

Structured pruning is more aggressive in what it gives up — you can’t remove fractional heads, so the granularity is coarse — but the speedup is real. Modern structured pruning methods include:

Head pruning: remove the least-important attention heads. Typically you can remove 10–30% of heads with small quality loss.
Layer pruning: remove entire transformer blocks. Some models have surprisingly redundant middle layers; removing 1-2 of 32 can have minimal effect.
Width pruning: shrink the hidden dimensions of the FFN or attention. More aggressive but harder to do without retraining.

The Sheared LLaMA paper (Xia et al., 2023) showed that you can structure-prune a Llama 7B down to 1.3B and recover most of the original quality with continued pretraining on a small amount of additional data. This is one of the most successful structured pruning results to date.

18.7 The lottery ticket hypothesis

A famous and deeply weird finding from Frankle & Carbin (2019): inside any large trained network, there exists a small subnetwork that, if trained from the same initialization, would have reached the same accuracy. This subnetwork is the “winning ticket,” and the rest of the weights are “lottery losers.”

The procedure:

Train a network normally.
Identify the smallest weights and prune them.
Reset the remaining weights to their original initialization values (not the trained values).
Train again with the same data.
The pruned network reaches the same accuracy as the original.

This is surprising because it suggests that most of the weights in a trained network were never doing useful work — they were initialized to bad values and never recovered. Only a small “winning” subnetwork was on the right initialization trajectory and learned anything useful.

The lottery ticket hypothesis is a fascinating theoretical result and has launched a small research industry. But its practical impact has been limited. The reason: finding the winning ticket requires training the full network first, which defeats the purpose. You can’t avoid the training cost by pruning, because you need the training to identify what to prune.

18.8 Why pruning has not displaced quantization

Despite decades of pruning research, quantization has won the production compression race. The reasons:

Quantization works on standard hardware. INT8 and FP8 matmul are first-class on H100/H200; INT4 has good kernel support too. Sparse matmul, by contrast, requires specialized support that most hardware doesn’t have.
Quantization is simpler operationally. It’s a single transformation applied to a trained model, with no retraining needed. Pruning typically requires fine-tuning to recover lost quality.
Quantization is more predictable. A given quantization scheme produces a known quality cost; pruning’s cost varies wildly by model, task, and pruning strategy.
Quantization gives larger speedups. INT8 doubles throughput vs bf16 on H100; FP8 doubles it again. Pruning at typical levels (50% sparsity) gives no speedup at all on dense kernels.

So in practice, the production workflow is:

Train a large model.
Quantize it (INT8, FP8, or AWQ INT4, depending on hardware and quality budget).
Serve.

Pruning rarely enters the picture, except in specialized edge deployment scenarios where storage matters more than speed, or in research settings.

The “you should know it exists” framing applies to pruning. It’s part of the senior vocabulary, it’s in the literature, and you should be able to explain why it hasn’t taken over. But you should not expect to use it in production.

18.9 Combining techniques

In practice, the techniques compose:

Distillation + quantization. Distill a 7B model from a 70B teacher, then quantize the 7B to INT4. Combined cost reduction: ~80×. This is one of the most common production paths.
Distillation + pruning. Distill a small model and then prune it further. Marginal additional savings.
Quantization-aware training. Train the model with a “fake quantization” forward pass that simulates the quantized arithmetic, so the model learns to be robust to quantization. Better quality than post-training quantization at the same bit width.
LoRA + quantization (QLoRA). We covered this in Chapter 15. Fine-tune a quantized base.

The skill is knowing the right combination for the budget and quality target.

18.10 Picking a compression strategy

The decision tree:

graph TD
  Start[Need a smaller / faster model] --> Q1{How much smaller?}
  Q1 -->|2× same arch| Q2[Quantize FP8/INT8]
  Q1 -->|4× same arch| Q3[Quantize INT4 AWQ]
  Q1 -->|10× new arch| Q4{Budget for distillation?}
  Q4 -->|small| Q5[Distill from pretrained smaller base]
  Q4 -->|large| Q6[Full distillation pipeline<br/>+ DPO on teacher preferences]
  Q2 --> Done[Serve]
  Q3 --> Done
  Q5 --> Done
  Q6 --> Done
  style Q4 fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style Q6 fill:var(--fig-surface),stroke:var(--fig-border)

Quantization is almost always the right first answer; distillation is reserved for cases where a different (smaller) architecture is needed.

Q: How much smaller does the model need to be?

2× smaller, same architecture → quantize to FP8 or INT8.
4× smaller, same architecture → quantize to INT4 (AWQ or GPTQ).
10× smaller, willing to change architecture → distill from a smaller pretrained base.
100× smaller, very specialized task → distill into a tiny task-specific model, then quantize.

Q: How much quality am I willing to give up?

Essentially none → FP8 or INT8.
A small amount (1–3 percentage points on benchmarks) → INT4 with AWQ.
A moderate amount (5–10 percentage points) → distill into a 7B–13B from a 70B+.
Whatever it takes → distill aggressively into a small model with task-specific data, then quantize.

Q: How much budget for the compression itself?

Almost none, fast turnaround → post-training quantization (an afternoon’s work).
Some budget → post-training quantization + a quick fine-tune on calibration data.
Significant budget → distillation pipeline (weeks, $10k+).
Major investment → from-scratch distilled model with custom data and multi-stage training.

The default for most production teams: quantize aggressively (INT4) and use the existing model as-is. Distillation is for cases where quantization alone isn’t enough or where you specifically need a smaller architecture for latency or memory reasons.

18.11 The mental model

Eight points to take into Chapter 19:

Compression is rarely the first lever. Quantization, smaller existing models, prefix caching all come before distillation.
Distillation trains a smaller student to mimic a larger teacher’s outputs.
Hard targets (the teacher’s argmax) are SFT-with-the-teacher’s-answers. Soft targets (the teacher’s full distribution at high temperature) carry “dark knowledge” — the relative likelihoods of unlikely answers.
Distillation works in practice. Frontier-to-small distillation is the common path to small high-quality models.
Unstructured pruning zeros individual weights. Saves storage, doesn’t speed up dense matmul.
Structured pruning removes whole heads, layers, or channels. Speeds up inference but is harder to apply without retraining.
The lottery ticket hypothesis is theoretically interesting but practically irrelevant for inference-cost reduction.
Quantization has won the production compression race. Pruning lives in research and specialized edge deployment.

In Chapter 19 we look at the data side of the post-training story: synthetic data, the new dominant paradigm.

Read it yourself

Hinton, Vinyals & Dean, Distilling the Knowledge in a Neural Network (2015). The original distillation paper.
Frankle & Carbin, The Lottery Ticket Hypothesis (2019). Read for the idea, not for production guidance.
Xia et al., Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning (2023).
The Phi technical report (any of Phi-1, Phi-2, Phi-3 — they all describe their distillation/curation approach).
Dettmers, 8-bit Optimizers via Block-wise Quantization (2022) — for the bridge between training compression and quantization.

Practice

Implement a hard-target distillation training loop: take a teacher model, generate (prompt, response) pairs, SFT a smaller student on them. Use a 70B teacher and a 7B student if you can; if not, scale down.
Implement a soft-target distillation loss in PyTorch with a temperature parameter. Verify the gradient through softmax(logits/T) is correct.
Why does the temperature in soft-target distillation flatten the distribution? Compute softmax([10, 5, 2]) at T=1, T=2, T=10. What’s the qualitative difference?
Why doesn’t unstructured pruning speed up matmul on standard GPUs? What hardware would make it useful?
Pick a small open model (e.g., Llama-3.2-1B) and remove 1 of its 16 attention heads at random. Measure the quality drop on a small eval. Now do the same for the least important head (you decide how to score importance). Compare.
Estimate the cost of distilling a 7B student from a 70B teacher on 10M (prompt, response) pairs. Include teacher inference cost, student training cost, and any preference-pair generation.
Stretch: Read the Sheared LLaMA paper. Reproduce the structured pruning + continued pretraining recipe on a small model and verify the quality recovers.

Concept check