Night-before cheat sheet

Part I: ML Foundations

10 chapters (1–10). 40 key facts to review.

Chapters at a glance

1 The mathematical objects: tensors, shapes, broadcasting

2 The forward pass: a neural network as a pure function

3 The backward pass: autograd, gradients, the chain rule made mechanical

4 Loss functions and optimization, in just enough depth

5 Tokens, vocabularies, and the tokenizer is the bug

6 Attention from first principles

7 The transformer end to end

8 The decoding loop: autoregressive generation, sampling, controllability

9 Embeddings and rerankers: what they are, why they're separate, why they're cheap

10 The model card: reading model lineage adversarially

Key facts (from quizzes)

Ch 1: The mathematical objects: tensors, shapes, broadcasting

What is the rank of a tensor with shape (3, 4, 5)?

3

Rank is the number of dimensions (the length of the shape tuple), which is 3. The values in the shape tuple are the sizes along each dimension.
When broadcasting two tensors of shapes (5, 1) and (1, 4), what is the output shape?

(5, 4)

NumPy/PyTorch broadcast rules expand size-1 dimensions to match the other operand. Dimension 0 expands to 5, dimension 1 expands to 4, giving (5, 4).
A tensor of shape (4, 6) is stored with strides (6, 1). After calling .T (transpose), what are the new strides?

(1, 6)

Transpose swaps axes, which swaps the strides. The original (6, 1) becomes (1, 6). No data is copied — only the metadata changes, so the tensor is no longer contiguous.
Which dtype change increases memory footprint while providing the widest dynamic range?

float32 to float64

float64 uses 8 bytes per element vs 4 for float32, doubling memory while extending dynamic range to roughly ±1.8×10^308. The other options either reduce memory or keep it the same.

Ch 2: The forward pass: a neural network as a pure function

For a linear layer with in_features=512 and out_features=256, how many trainable parameters does it have (including bias)?

512 × 256 + 256

The weight matrix W has shape (256, 512) giving 131,072 parameters, plus one bias term per output feature (256). Total is 131,072 + 256 = 131,328.
Why is a nonlinear activation function necessary between linear layers?

Without it, stacking linear layers collapses to a single linear transformation with no additional expressive power

The composition of linear functions is itself linear: W2(W1x) = (W2W1)x. Without nonlinearities, a 100-layer network is mathematically identical to a single linear layer.
GELU is preferred over ReLU in modern transformers primarily because it is

Smooth and differentiable everywhere, avoiding the dying-neuron problem

ReLU is exactly zero for all negative inputs, which can permanently kill neurons. GELU is smooth near zero and has non-zero gradient for most negative inputs, making optimization more robust.
Xavier/Glorot initialization sets initial weights to have variance proportional to 1/(fan_in + fan_out). What failure mode does this prevent?

Signal variance exploding or vanishing as it propagates through many layers

If weights are too large the signal variance grows exponentially with depth; too small and it shrinks to zero. Xavier init targets a variance that keeps activations roughly unit-variance through the whole forward pass.

Ch 3: The backward pass: autograd, gradients, the chain rule made mechanical

Why does activation memory grow with batch size during the backward pass?

Intermediate activations from the forward pass must be stored for each sample in the batch to compute gradients

Each sample in the batch produces its own set of intermediate activations that must be held in memory until the backward pass uses them to compute gradients. Peak activation memory is therefore O(batch_size × depth).
What is the key reason reverse-mode autodiff (backprop) dominates ML rather than forward-mode autodiff?

Reverse-mode computes gradients with respect to all parameters in a single backward pass regardless of parameter count

With forward-mode you need one pass per input dimension. With reverse-mode you need one pass per output dimension. For ML (scalar loss, millions of parameters), one backward pass gives all gradients simultaneously.
What does tensor.detach() do that torch.no_grad() does not?

detach() creates a new tensor with no grad_fn and no gradient flow back through it; no_grad() prevents new ops from being recorded but doesn't sever an existing graph

detach() cuts a specific tensor out of the computation graph — gradients will not flow through it during backward. no_grad() is a context manager that tells autograd not to record any operations performed inside it.
Gradient checkpointing trades compute for memory by

Discarding all intermediate activations and recomputing them from the nearest checkpoint during the backward pass

Checkpointing stores only the inputs to designated segments (not the intermediate activations). During backward, those segments are re-run in forward mode to regenerate the activations needed for gradient computation.

Ch 4: Loss functions and optimization, in just enough depth

AdamW differs from Adam by decoupling weight decay. Concretely, what does this mean?

AdamW applies weight decay directly to the parameters rather than adding an L2 penalty to the loss

In Adam, L2 regularization interacts with the adaptive learning rates and is effectively scaled down for parameters with large gradients. AdamW bypasses the optimizer state and subtracts a fixed fraction of the weight directly, matching the intended regularization effect.
For a model trained with AdamW in full fp32, how many bytes of optimizer state are stored per parameter?

8 bytes (first and second moment in fp32)

Adam stores two moving averages per parameter — the first moment (mean gradient) and the second moment (mean squared gradient) — each in fp32 (4 bytes). That's 8 bytes of optimizer state per parameter, on top of the parameter itself.
Cross-entropy loss for a softmax classifier is often called 'negative log-likelihood.' Why negative?

Because probability values are always less than 1, so their log is negative, and we negate to get a positive loss to minimize

log(p) for any probability 0 < p ≤ 1 is non-positive. To turn 'maximize log-likelihood' into a standard minimization problem we negate it, giving a loss that is zero when the predicted probability is 1 and grows toward infinity as it approaches 0.
A learning rate warmup schedule holds lr near zero for the first N steps before rising to the target. The main reason for warmup is

To avoid large updates when the optimizer's moment estimates are unreliable early in training

At step 0 Adam's moment estimates are initialized to zero and haven't tracked any gradients yet. Taking a full-size step with uninitialized moments can send parameters to extreme values. Warmup gives the moment estimates time to stabilize before full-scale updates happen.

Ch 5: Tokens, vocabularies, and the tokenizer is the bug

Byte-Pair Encoding (BPE) constructs its vocabulary by repeatedly

Merging the most frequently co-occurring adjacent pair of symbols into a new symbol

BPE starts from single bytes or characters and greedily merges the most frequent adjacent pair into a new compound symbol. This is repeated until the target vocabulary size is reached, building up common subwords organically.
A user asks why GPT-4 cannot correctly count the letters in the word 'strawberry.' The real explanation is

The word is tokenized into subword units that don't correspond to individual letters, so the model never sees raw characters

'strawberry' is typically split into tokens like 'straw' + 'berry' — the individual character 'r' never appears as a distinct token. Counting characters requires the model to reason about subword internals it was never trained to track.
A model trained with tokenizer version A is deployed with tokenizer version B, where a common token was split differently. What is the most likely production symptom?

Silent quality degradation — inputs are tokenized differently than during training, but the model accepts them without error

If both tokenizers share the same vocabulary size the embedding layer won't error. Token IDs for the same text will differ, meaning the model sees different integer sequences than it was trained on — degrading quality silently.
Why do languages with complex morphology (e.g., Turkish, Finnish) cost more to process with BPE tokenizers trained primarily on English?

Morphologically rich words appear rarely as complete units, so BPE falls back to many short subword pieces per word, inflating sequence length and cost

An English-biased BPE tokenizer will have few or no merged tokens for rare morphological forms. Each such word is split into many pieces, making sequences longer. Attention cost scales quadratically with sequence length, so multilingual text is proportionally more expensive.

Ch 6: Attention from first principles

Why do we divide the attention scores by √d_k before the softmax?

To keep the pre-softmax variance constant across model widths and prevent softmax saturation.

Random dot products of d_k-dimensional unit-variance vectors have variance d_k. Dividing by √d_k normalizes that to variance 1, keeping logits in a range where softmax gradients don't vanish.
What is the memory complexity of a naive attention implementation in the sequence length s?

O(s²)

The score matrix has one entry per (query, key) pair — s² entries. Materializing it is the reason FlashAttention exists.
In multi-head attention with model dimension d_model and H heads, what is the per-head dimension d_h?

d_model / H

The H heads partition the model dimension so total compute is roughly the same as single-head attention with dimension d_model. Each head sees a d_model/H-sized slice.
Why does the causal mask only matter for decoder-style (autoregressive) attention, not encoder attention?

Encoders see the whole sequence at once because they're not predicting the next token, so every position is allowed to attend to every other position.

The causal mask enforces that position i cannot use information from positions > i — a constraint for autoregressive generation. Encoders (e.g., BERT) are doing representation learning over the full input, so every position can freely attend to every other position.

Ch 7: The transformer end to end

In a pre-norm transformer block, LayerNorm is applied before the attention and FFN sub-layers. What training problem does this solve compared to post-norm?

It stabilizes gradient magnitudes at initialization, allowing training without careful learning rate warmup

Post-norm places LayerNorm after the residual addition; at initialization this can produce very large or very small gradient norms deep in the network. Pre-norm normalizes inputs before each sub-layer, keeping gradient scale predictable and making training robust without warmup.
RoPE (Rotary Position Embedding) encodes position by

Rotating the query and key vectors by an angle proportional to their absolute position before the attention dot product

RoPE applies a rotation matrix to Q and K, chosen so that the dot product QK^T depends only on the relative distance between positions. This gives the model a built-in sense of relative position and generalizes better to longer contexts than absolute embeddings.
The SwiGLU FFN used in Llama replaces the standard two-linear-layer FFN. Its key structural difference is

It splits the projection into two branches and multiplies them elementwise — one branch passes through Swish, gating the other

SwiGLU is a gated linear unit: FFN(x) = (Swish(W1 x) ⊙ W2 x) W3. The Swish-activated branch acts as a soft gate on the other branch, improving expressiveness. To keep FLOPs comparable to the original FFN, the intermediate dimension is typically reduced from 4d to roughly 2.67d.
Why did the decoder-only architecture become dominant over encoder-decoder for large language models?

Decoder-only models unify pretraining and fine-tuning under a single next-token prediction objective, simplifying scaling

A decoder-only model pretrained on next-token prediction can be fine-tuned or prompted for any task without architectural changes. Encoder-decoder models require task-specific design choices about what goes in the encoder vs decoder, adding complexity that doesn't pay off at large scale.

Ch 8: The decoding loop: autoregressive generation, sampling, controllability

Prefill and decode have radically different performance profiles. The key reason is

Prefill processes all prompt tokens in parallel (compute-bound), while decode generates one token at a time against a growing KV cache (memory-bound)

During prefill the GPU computes attention over the full prompt in a single forward pass — this is arithmetic-intensive. During decode each step produces only one new token and the bottleneck is reading the KV cache from HBM, making it memory-bandwidth-bound.
Top-p (nucleus) sampling selects from the smallest set of tokens whose cumulative probability exceeds p. Compared to top-k, the main advantage is

Top-p adapts the number of candidates to the actual distribution sharpness, drawing from more tokens when the model is uncertain and fewer when confident

Top-k always considers exactly k options regardless of how peaked or flat the distribution is. Top-p is dynamic — when the model is very confident one token dominates and the nucleus is small; when uncertain many tokens share probability mass and the nucleus is larger.
Setting temperature to 0 should produce the same greedy output every time. In practice this often fails because

Floating-point non-determinism in GPU operations can produce different top-1 tokens across runs when two tokens have nearly equal logits

GPU operations like matrix multiplications may execute in different orders depending on hardware state and parallelism, producing slightly different floating-point results. When two tokens have nearly identical logits, these tiny differences can flip which token has the maximum.
Beam search was the dominant decoding algorithm for seq2seq models but is rarely used for large LLMs. The main reason is

Beam search produces high-likelihood but often generic, repetitive text that users prefer less than samples from good temperature/top-p settings

Beam search maximizes the probability of the output sequence, which tends to produce bland, repetitive text. For open-ended generation, users strongly prefer diverse samples from well-tuned temperature + nucleus sampling. Memory cost is a secondary concern.

Ch 9: Embeddings and rerankers: what they are, why they're separate, why they're cheap

A bi-encoder (embedding model) and a cross-encoder (reranker) both score a query against a document. The fundamental architectural difference is

Bi-encoders encode query and document independently and compare embeddings; cross-encoders concatenate query and document and score them jointly

A bi-encoder embeds each piece of text separately, enabling pre-computation and ANN search. A cross-encoder sees the concatenated (query, document) pair, allowing full cross-attention between them — higher quality but impossible to pre-compute, so it's only practical for re-scoring a short candidate list.
Contrastive training of an embedding model with in-batch negatives works by

Treating all other examples in the same batch as negatives for each anchor-positive pair and minimizing a softmax cross-entropy over the batch

For a batch of N (query, positive) pairs, in-batch negatives turns the problem into N-way classification: the model must assign higher similarity to the true positive than to all N-1 other positives in the batch, which act as negatives. This is efficient because no extra negative mining is needed.
Why is L2-normalizing embedding vectors before computing similarity almost always the right choice for retrieval?

It converts cosine similarity to an equivalent dot product, enabling faster FAISS or vector-DB operations while controlling for vector magnitude

After L2 normalization, cosine_similarity(a, b) = dot(a, b), so you can use fast dot-product ANNS indexes (like FAISS inner product). It also ensures similarity reflects direction (semantic meaning) rather than magnitude (length of the text), which is almost always what you want.
ColBERT (late interaction) differs from a standard bi-encoder by

Storing a separate embedding per token for documents, then computing the max-sim score at query time rather than a single pooled vector

ColBERT stores one embedding per document token (not one per document). At query time each query-token embedding picks its maximum similarity across all document-token embeddings; these max-sims are summed. This 'late interaction' gives near-cross-encoder quality at near-bi-encoder speed.

Ch 10: The model card: reading model lineage adversarially

A model card shows strong benchmark scores on MMLU, HumanEval, and MT-Bench. The most important adversarial question to ask about these numbers is

Were the evaluations run by the model authors on data the model may have been trained on?

Model authors have every incentive to run evals on their own infrastructure and may not prevent training-set contamination. Independent reproduction on held-out data almost always shows lower numbers. This is the single most important thing to scrutinize.
A model is described as derived from a base with the lineage 'base → SFT → DPO → GGUF Q4_K_M.' What does Q4_K_M tell you about serving cost vs the full-precision model?

The model weights have been quantized to approximately 4 bits per parameter, reducing memory by roughly 8× vs fp32

GGUF Q4_K_M is a 4-bit quantization format used by llama.cpp. At ~4 bits per weight (vs 32 for fp32 or 16 for bf16), the model is 8× smaller than fp32 and 4× smaller than bf16, making it runnable on consumer hardware at the cost of some quality.
A model card advertises a 128k context window. The adversarial interpretation is

Perplexity scores at 128k context are near-random because the model was actually trained on much shorter sequences — the long context may work mechanically but performance degrades

Many models advertise large context windows through simple RoPE scaling without training on long sequences. The model may produce syntactically valid output at 128k tokens, but retrieval and reasoning quality often degrades severely beyond the actual training context length.
The 'base model' under a chat fine-tune is important to identify because

Chat fine-tunes inherit the base model's license, which may restrict commercial use

Fine-tuning does not change the license of the underlying base model. A Llama-2 fine-tune inherits Llama-2's community license terms. Using a 'permissively licensed' fine-tune that sits on a restrictive base in a commercial product is a legal risk that model cards often obscure.

Other parts

Part I Part II Part III Part IV Part V Part VI Part VII Part VIII Part IX Part X Part XI