Night-before cheat sheet

Part II: Training, Fine-Tuning, Alignment

10 chapters (11–20). 40 key facts to review.

Key facts (from quizzes)

Ch 11: Pretraining at scale: data, compute, curriculum

  • The Chinchilla scaling law (Hoffmann et al., 2022) found that prior large models were undertrained. Its key prescription is
    For a fixed compute budget, optimal loss is achieved by scaling parameters and training tokens roughly equally — not by maximizing model size alone
    Pre-Chinchilla models like GPT-3 were trained on far fewer tokens than Chinchilla-optimal. The law shows that doubling compute should split roughly evenly between more parameters and more data, not go entirely to bigger models.
  • Data deduplication is called the most important data preprocessing step. The main reason is
    Near-duplicate documents cause the model to memorize specific text spans and degrade generalization and benchmark integrity
    When the same text appears thousands of times, the model strongly memorizes it verbatim, hurting generalization and creating extraction risks. Deduplication also prevents benchmark contamination if test-set text appears in the crawl.
  • A frontier model is trained on 15 trillion tokens with a vocabulary size of 100k. Approximately how many bytes does storing the raw token ID sequence require?
    30 TB (2 bytes per token, since vocab > 65535)
    With a 100k vocabulary, token IDs don't fit in 1 byte (max 255) but do fit in 2 bytes (uint16, max 65535 — or use int32 at 4 bytes for safety). In practice most implementations use int32/uint32 giving 60 TB, but 2-byte uint16 is common for storage-optimized pipelines.
  • Why do most companies choose not to pretrain their own frontier LLM even when they have the ML talent?
    The compute and data acquisition cost for a frontier model is hundreds of millions of dollars, and the resulting model is not meaningfully better than open weights for most applications
    A frontier pretraining run costs $50M–$500M+ in compute alone, requires custom distributed infrastructure, and the resulting model rarely outperforms open-weights alternatives for most downstream tasks. Fine-tuning open weights is orders of magnitude cheaper and often produces better task-specific results.

Ch 12: Distributed training: DDP, FSDP, ZeRO, tensor/pipeline/sequence parallel

  • ZeRO Stage 3 shards model parameters across all data-parallel ranks in addition to optimizer state and gradients. The main cost of this compared to Stage 1 or 2 is
    Each forward and backward pass requires an all-gather of parameters before use and discard after, adding significant communication overhead
    With Stage 3, each GPU only holds a shard of the parameters. Before computing any layer, all ranks must all-gather that layer's parameters, then discard them after use. This all-gather happens once per layer per forward pass and again per backward pass — significant communication but the only way to fit models whose parameters alone exceed single-GPU memory.
  • Tensor parallelism splits individual matrix multiplications across GPUs. The communication primitive required between the split operations is
    All-reduce
    In Megatron-style column/row tensor parallelism, each GPU computes a partial result of the matrix multiply. An all-reduce sums these partial results across all tensor-parallel ranks before the output is passed to the next operation. (Reduce-scatter + all-gather is equivalent but used in sequence parallel variants.)
  • Pipeline parallelism assigns contiguous groups of transformer layers to different GPUs. The main efficiency problem it introduces is
    Pipeline bubbles — GPUs sit idle at the start and end of each micro-batch as the pipeline fills and drains
    In a simple pipeline with F stages, up to F-1 stages are idle (bubbled) while the first micro-batch moves through. Micro-batching and interleaved schedules reduce but don't eliminate this overhead. The bubble fraction is (F-1)/(m + F-1) for m micro-batches.
  • For a 70B AdamW model in bf16 weights with fp32 optimizer state, what is the minimum per-GPU memory needed to hold just the optimizer state when training with ZeRO Stage 1 across 64 GPUs?
    560 GB / 64 ≈ 8.75 GB
    AdamW stores two fp32 (4-byte) moments per parameter: 2 × 4 × 70×10^9 = 560 GB of optimizer state total. ZeRO Stage 1 shards this evenly across all ranks, so each of the 64 GPUs holds 560/64 ≈ 8.75 GB — still significant, but far less than the 560 GB total.

Ch 13: Mixed precision training: fp16, bf16, fp8, loss scaling

  • bf16 is safer than fp16 for training despite having fewer mantissa bits. The reason is
    bf16 shares the same 8-bit exponent as fp32, giving the same dynamic range and preventing gradient underflow
    fp16 has only a 5-bit exponent, limiting its dynamic range. Gradients that are small (common late in training) underflow to zero. bf16's 8-bit exponent matches fp32's dynamic range, so very small or very large gradients are representable without loss scaling.
  • Loss scaling is required for fp16 training because
    Small gradient values fall below fp16's minimum representable non-zero value and flush to zero
    fp16's smallest normal number is ~6×10^-5. Late-stage gradients are often much smaller. Loss scaling multiplies the loss by a large constant before backward, scaling all gradients up so they don't underflow; the optimizer then divides them back before the parameter update.
  • A training run uses 'master weights' — an fp32 copy of parameters kept alongside bf16 weights. What is the purpose of the fp32 copy?
    To accumulate small gradient updates accurately — updates that are tiny relative to the weight value would be lost in bf16 rounding
    When a gradient update is much smaller than the current weight magnitude, adding them in bf16 can round the update away entirely. The fp32 master copy preserves these small deltas accurately. After each optimizer step the fp32 master is downcast to bf16 for use in the forward/backward passes.
  • fp8 training on H100 offers roughly 2× throughput vs bf16 but requires calibration. The key challenge calibration solves is
    fp8's tiny dynamic range means tensors must be scaled per-layer so their values map into fp8's representable range without overflow or catastrophic rounding
    fp8 (e4m3) can represent values only in a narrow range. Without per-tensor or per-channel scale factors, most activations and gradients will either overflow to NaN or underflow to zero. Calibration measures the actual value distributions and chooses scale factors that keep values in fp8's representable range.

Ch 14: Tokenizer training: BPE and SentencePiece from scratch

  • What does BPE greedily optimize during each merge step?
    The pair of symbols with the highest frequency across the corpus
    BPE always merges the most frequent adjacent pair. This greedy frequency-first approach is the core algorithm and has no notion of morphology or entropy.
  • Why does a larger vocab size generally reduce sequence length but increase training difficulty?
    Larger vocabs merge more pairs so each token is rarer and harder for the model to learn well
    Each additional merge reduces average sequence length, but the new merged token is rarer in the corpus. Rarer tokens have fewer learning examples, making each token harder to embed well.
  • What is the key reason you cannot retrain a tokenizer after a model is already pretrained?
    The model weights encode the meaning of specific token IDs and changing the vocabulary breaks those learned associations
    Every token embedding in the pretrained model is keyed to a specific integer ID. Retraining the tokenizer changes which string maps to which ID, invalidating all learned embeddings.
  • SentencePiece differs from standard BPE primarily because it treats whitespace as a regular character and does pre-tokenization differently. What concrete problem does this solve for multilingual models?
    It handles languages without whitespace word boundaries uniformly by treating the space as an ordinary merging target
    Languages like Chinese and Japanese have no word-boundary spaces. SentencePiece processes the raw character stream with space as a symbol, so the same algorithm applies across all writing systems without a language-specific pre-tokenizer.

Ch 15: Fine-tuning: full, LoRA, QLoRA, adapters, the PEFT family

  • In LoRA, what does the rank r control?
    The dimension of the low-rank matrices A and B whose product approximates the weight delta
    LoRA decomposes the weight update dW into A (d x r) times B (r x k). Rank r is the bottleneck dimension; smaller r means fewer trainable parameters and stronger regularization.
  • What key memory saving does QLoRA achieve over standard LoRA?
    It stores the frozen base model in 4-bit NF4 quantization while training LoRA adapters in bf16
    QLoRA keeps the base model in NF4 (roughly 4 bits/parameter) on GPU, which cuts the base model's memory footprint by ~4x. Adapter parameters and optimizer state remain in higher precision.
  • Why is merging a LoRA adapter back into the base weights preferred over keeping them separate at serving time?
    Merging removes the inference overhead of the adapter forward pass and produces a single standard model weight file
    Separate adapters add an extra matrix multiply at every forward pass. Merging (W_merged = W + AB) bakes the delta directly into the weight, eliminating adapter overhead with no quality change.
  • A team has a 7B base model and 2000 high-quality (instruction, response) pairs for a narrow classification task. Full fine-tuning produces marginal improvement over the base model. What is the most likely root cause?
    The learning rate was too high, causing the model to memorize the 2000 examples and catastrophically forget general capabilities
    2000 examples is tiny relative to the number of parameters. A high learning rate on such a small dataset commonly drives the model to overfit the training data while erasing pretrained knowledge, a classic catastrophic forgetting failure.

Ch 16: SFT and instruction tuning: turning a base model into an assistant

  • During SFT, loss is typically masked on the prompt and computed only on the response tokens. Why?
    The model should learn to generate the response, not to predict the instruction it was already given
    The goal of SFT is to teach the model to produce good responses. Backpropagating through prompt tokens would train the model to predict user inputs, which is not the desired behavior and dilutes the training signal.
  • What is catastrophic forgetting in the context of SFT, and what is the primary mitigation?
    Fine-tuning on a narrow task erodes general pretrained capabilities; mitigated by mixing general instruction data into the SFT set
    Catastrophic forgetting happens when gradient updates for the new task overwrite weights that encode general skills. Including diverse general instruction examples in the training mix preserves those capabilities.
  • Why does dataset quality matter more than dataset quantity for SFT?
    A handful of high-quality demonstrations teaches the correct output distribution; low-quality examples train the model to imitate bad patterns at scale
    SFT is imitation learning. A model trained on noisy or low-quality responses learns to produce noisy, low-quality outputs. A small set of excellent demonstrations reliably outperforms a large set of mediocre ones.
  • A model trained with SFT using chat template A is deployed behind a front-end that applies chat template B. What failure mode should you expect?
    The model will misidentify turn boundaries, likely merging roles or ignoring the system prompt
    The chat template is baked into the token sequence the model was trained on. Different special tokens and delimiters mean the model cannot reliably parse which tokens are system, user, and assistant turns.

Ch 17: RLHF, DPO, KTO, Constitutional AI

  • In the classical RLHF pipeline, why is a KL penalty applied between the policy and the frozen SFT reference model?
    Without KL regularization, the policy collapses to degenerate outputs that score highly on the reward model but are not useful
    Reward hacking: unconstrained PPO discovers outputs the reward model scores highly but that are nonsensical or repetitive. The KL penalty keeps the policy close to the SFT reference, bounding how far it can deviate.
  • DPO eliminates the reward model. What does it optimize directly instead?
    The log-likelihood of the preferred response over the dispreferred response, weighted by the implicit reward derived from the policy ratio
    DPO derives a closed-form expression for the optimal reward in terms of the policy and the reference, then collapses the pipeline into a single loss on preference pairs without needing a separate reward model.
  • Why does the classical RLHF pipeline with PPO require four model copies in memory simultaneously?
    Policy, reference policy, reward model, and value function model must all be live
    PPO-based RLHF needs the active policy (being updated), a frozen copy of the SFT reference (for KL), the reward model (for scoring rollouts), and a value model (critic for advantage estimation).
  • KTO uses only binary thumbs-up or thumbs-down labels rather than pairwise preferences. Under what data conditions does this make KTO preferable to DPO?
    When you have abundant binary feedback but cannot pair up a specific preferred response with a specific dispreferred response for the same prompt
    DPO requires matched (prompt, chosen, rejected) triples. Real-world feedback systems often collect per-response thumbs up or down without creating matched pairs. KTO is designed for this unpaired binary signal.

Ch 18: Distillation, pruning, and training-time compression

  • What is the 'dark knowledge' that soft targets from a teacher model provide to a student during distillation?
    The probability distribution over all classes which encodes the teacher's confidence and similarity structure among classes
    Soft targets expose inter-class similarity in the teacher's probability distribution (e.g., a 'cat' image scoring 0.01 for 'dog' is informative). Hard one-hot labels discard all of this structural information.
  • Why is temperature T > 1 used when computing soft targets from a teacher model during distillation?
    Higher temperature smooths the teacher's output distribution, making the non-peak probabilities larger and more informative for the student
    Dividing logits by T before softmax stretches the distribution. At T=1 the distribution is often near one-hot; at T>1 smaller logits produce larger probabilities, giving the student richer gradient signal from the similarity structure.
  • Unstructured pruning removes individual weights based on magnitude. Why has it not displaced quantization as the dominant inference compression technique?
    Unstructured sparsity requires specialized sparse kernels to realize speedup; dense quantized matrix multiplies map directly onto existing GPU hardware Tensor Cores
    Irregular sparse weight matrices cannot use standard dense matrix multiply hardware efficiently. Without hardware-native sparse support, the theoretical compute savings do not translate to wall-clock speedup, whereas INT8 and FP8 have native Tensor Core paths.
  • The Lottery Ticket Hypothesis claims a sparse subnetwork exists that can train to the same accuracy as the full network. Why has this not led to a practical training speedup?
    Finding the winning ticket requires first training the full network to identify which weights to keep
    Identifying the winning ticket requires full training (or iterative magnitude pruning) of the original network. You can only prune after training, eliminating any training-time compute savings.

Ch 19: Synthetic data and self-improvement loops

  • Self-Instruct generates SFT training data without human-written examples for each task. What is its core mechanism?
    A strong LLM is prompted with a small seed set of tasks to generate new task instructions and corresponding responses
    Self-Instruct uses a model to bootstrap diversity from a small seed set, generating new instructions and responses that are then filtered and used for fine-tuning. This breaks the dependence on large human annotation.
  • Rejection sampling generates many candidate responses per prompt and keeps only the best ones. What quality signal is typically used to select which responses to keep?
    A reward model score or a verifiable correctness check such as code execution or math answer verification
    Rejection sampling is useful when you can define 'best' cheaply. For math and code, exact correctness is verifiable. For open-ended tasks, a reward model provides the ranking signal.
  • Model collapse refers to the degradation in diversity and quality that occurs when models are trained on data generated by earlier model generations. What is the proposed root cause?
    Each generation's model amplifies its own distributional biases in the data it produces, causing cumulative drift away from the true data distribution
    Each model generation produces outputs that reflect its own biases and blind spots. When the next generation trains on those outputs, it learns the biases more strongly, and the cycle compounds across generations.
  • Constitutional AI uses a model's self-critique to produce preference data without human raters. Why does this still require a high-quality base model to function correctly?
    A model with poor judgment produces self-critiques that reflect its own flaws, reinforcing rather than correcting bad behaviors
    The self-critique loop is only as good as the model's ability to judge its own outputs. A weak or misaligned model produces approvals for its own flawed responses, making the synthetic preference data meaningless or harmful.

Ch 20: Evaluation: the hardest unsolved problem in ML

  • What is benchmark contamination and why does it make published benchmark scores upper bounds rather than true capability estimates?
    Benchmark test questions may appear in pretraining or fine-tuning data, causing inflated scores that do not reflect generalization to unseen problems
    If a model has seen benchmark questions during training, it can recall rather than reason. The resulting score reflects memorization, not capability, making comparisons across models unreliable.
  • An LLM-as-judge pipeline rates model A higher than model B. What well-documented bias should make you skeptical of this result?
    Judge models are biased toward the first response shown in pairwise comparisons and toward responses stylistically similar to their own training data
    LLM judges show position bias (preferring whichever response comes first), verbosity bias (preferring longer answers), and self-enhancement bias (preferring outputs similar to their own training distribution).
  • Why is the LMSYS Chatbot Arena considered more contamination-resistant than static benchmarks like MMLU?
    Arena questions are generated by real users in real time and are not publicly known in advance, making it impossible to train specifically on the test set
    Static benchmarks have fixed public questions that can leak into training data. Arena questions are unpublished user queries submitted live, so there is no fixed test set to contaminate.
  • A team evaluates their new fine-tuned model on internal task-specific golden sets and sees improvement, but also runs MMLU and sees regression. How should they interpret this?
    The fine-tuning improved the targeted task at the cost of some general knowledge; whether this is acceptable depends entirely on the deployment use case
    No single eval tells the full story. A focused fine-tune commonly trades some general capability for task performance. The right interpretation requires comparing the tradeoff against the actual user requirement, not declaring one eval the winner.