Math review for the rusty
The math used in the rest of this book is a small subset of undergraduate linear algebra, probability, and information theory. This appendix is a targeted refresher: enough to follow the derivations in Chapters 6 (attention), 15 (LoRA), 17 (DPO), 25 (FlashAttention), 26 (quantization), and 33 (MLA) without feeling lost.
Not a textbook. If you want a real textbook, read Bishop’s Pattern Recognition and Machine Learning or Deisenroth, Faisal, Ong, Mathematics for Machine Learning. This is the 30-minute version that gets you unstuck.
C.1 Linear algebra
C.1.1 Vectors, matrices, and the shapes that matter
A vector is an ordered list of numbers. In ML we write vectors as columns by default, which means a vector x of length d has shape (d, 1) as a matrix but is often written just (d,). A row vector is its transpose, shape (1, d).
A matrix is a rectangular grid of numbers with shape (m, n): m rows, n columns. The (i, j) entry is M[i, j], the number in the i-th row and j-th column.
The operations you need to know:
- Addition and scalar multiplication. Elementwise. Same shapes required.
- Transpose
Mᵀ. Flips rows and columns: shape(m, n)becomes(n, m), withMᵀ[i, j] = M[j, i]. - Matrix multiplication
AB. Only defined whenAhas shape(m, k)andBhas shape(k, n). The result has shape(m, n), with(AB)[i, j] = Σₖ A[i, k] · B[k, j]. The inner dimensions must match; the outer dimensions become the output shape. Matrix multiplication is not commutative: in generalAB ≠ BA, and often only one of them is even shape-valid.
Small worked example. Let
A = [[1, 2], B = [[5, 6],
[3, 4]] [7, 8]]
Then AB[0, 0] = 1·5 + 2·7 = 19, AB[0, 1] = 1·6 + 2·8 = 22, AB[1, 0] = 3·5 + 4·7 = 43, AB[1, 1] = 3·6 + 4·8 = 50. So
AB = [[19, 22],
[43, 50]]
A dot product is the special case where both operands are vectors: aᵀb = Σᵢ aᵢ bᵢ. A scalar. Geometrically, aᵀb = ||a|| · ||b|| · cos θ, the product of magnitudes times the cosine of the angle between them. This is why cosine similarity is just a dot product of unit-length vectors.
An outer product is the other way around: abᵀ where a has shape (m, 1) and b has shape (n, 1) yields a matrix of shape (m, n) with rank 1.
C.1.2 Why attention is a matmul
The attention formula from Chapter 6:
Attention(Q, K, V) = softmax(QKᵀ / √d) V
Read it as matmuls:
Qhas shape(S, d). One row per query token, one column per feature.Khas shape(S, d). One row per key token.Kᵀhas shape(d, S).QKᵀhas shape(S, S). Entry(i, j)is the dot product of queryiand keyj— how much queryicares about keyj, before normalization.- Dividing by
√dkeeps the pre-softmax values from scaling linearly withd(intuition: dot products of random unit vectors in high dimensions have variance that grows withd). softmax(...)is applied row-wise, producing an(S, S)matrix where each row sums to 1.- Multiplying by
V(shape(S, d)) yields the output of shape(S, d).
Two matmuls, one softmax. The whole transformer is a stack of matmuls plus some elementwise operations (activation, layer norm, residual add). This is why the entire discipline of LLM optimization is a discipline of matmul optimization.
C.1.3 Identity, inverse, and singular matrices
The identity matrix I is the square matrix with 1s on the diagonal and 0s off-diagonal. IA = A and AI = A for any matrix with compatible shape.
A square matrix A has an inverse A⁻¹ if AA⁻¹ = A⁻¹A = I. Non-square matrices do not have inverses in the strict sense; they have pseudo-inverses (the Moore-Penrose pseudo-inverse A⁺).
A square matrix is singular (non-invertible) if its determinant is zero, which happens iff its rows (or columns) are linearly dependent. Linear dependence means you can write one row as a combination of the others — the matrix has redundant information. Rank is the formal measure of this.
You rarely compute actual matrix inverses in ML. Solving Ax = b is done via factorizations (LU, QR, Cholesky) rather than computing A⁻¹ explicitly because inverses are numerically unstable.
C.1.4 Rank
The rank of a matrix is the number of linearly independent rows (equivalently, columns). A matrix with shape (m, n) has rank at most min(m, n). If the rank equals the smaller dimension, the matrix is full rank. Otherwise, it’s rank-deficient, and can be written as the product of smaller matrices.
Concretely: a rank-r matrix of shape (m, n) can be written as
M = UVᵀ
where U has shape (m, r) and V has shape (n, r). The total number of parameters dropped from m · n (dense) to (m + n) · r (factored). When r ≪ min(m, n), this saves a lot of memory.
This is the core math of low-rank approximation.
C.1.5 Low-rank approximation and why LoRA works
The LoRA paper (Hu et al. 2021) claims that fine-tuning a pretrained LLM rarely changes the weight matrices much — the delta ΔW is approximately low-rank. So instead of learning the full ΔW (with m · n parameters), you parameterize it as
ΔW = BA
where A has shape (r, n) and B has shape (m, r), for some small r (typically 8, 16, 32, 64). You learn A and B and freeze the base W. At inference, the adapted weight is W + BA.
The parameter count drops from m · n to (m + n) · r. For m = n = 4096, r = 8, that’s from 16.7M to 65K, a 256× reduction.
Why it works: empirically, the singular value spectrum of ΔW is dominated by a few large singular values, so approximating with a low-rank matrix captures most of the delta. The Eckart-Young theorem tells us that the best rank-r approximation (in Frobenius norm) of any matrix is obtained by truncating its SVD to the top r singular values. LoRA is an approximate, trained version of this truncation.
C.1.6 Eigendecomposition and SVD
The eigendecomposition of a square matrix: vectors v such that Av = λv for some scalar λ. Those vectors are eigenvectors; the scalars are eigenvalues. A matrix with n linearly independent eigenvectors can be written
A = VΛV⁻¹
where V has eigenvectors as columns and Λ is diagonal with eigenvalues on the diagonal. This only works for diagonalizable matrices, which is not all of them, and the concept only really makes sense for square matrices.
The singular value decomposition (SVD) is more general. For any matrix M of shape (m, n):
M = UΣVᵀ
where U has shape (m, m) and orthonormal columns, Σ has shape (m, n) with non-negative singular values on the diagonal (and zeros elsewhere), and V has shape (n, n) with orthonormal columns. The singular values are conventionally sorted largest to smallest. SVD always exists. No square requirement.
The rank of M equals the number of nonzero singular values. The Frobenius norm (sum of squared entries) equals the sum of squared singular values.
Low-rank approximation via SVD: keep only the top r singular values and corresponding columns of U and V. That’s the best rank-r approximation of M in Frobenius norm. The approximation error is the sum of the remaining squared singular values. If the spectrum decays quickly, a low-rank approximation is nearly perfect.
C.1.7 Norms
A few you’ll see:
- L2 norm of a vector:
||x||₂ = √(Σ xᵢ²). The Euclidean length. - L1 norm:
||x||₁ = Σ |xᵢ|. Used in sparse regularization. - Infinity norm:
||x||∞ = maxᵢ |xᵢ|. The largest component. - Frobenius norm of a matrix:
||M||_F = √(Σᵢⱼ Mᵢⱼ²). The elementwise L2. - Spectral norm of a matrix: the largest singular value. The “operator” norm,
max ||Mx|| / ||x||.
In training, you’ll see gradient clipping by the L2 norm of the global gradient vector. In quantization, you’ll see the max norm (infinity norm) used to set scale factors.
C.1.8 Orthogonality
Two vectors are orthogonal if their dot product is zero. A set of vectors is orthonormal if each is unit length and they’re pairwise orthogonal. The columns of the attention projection matrices are not orthonormal in practice, but many useful constructions (random rotations, discrete Fourier transforms, QR decompositions) produce orthonormal matrices, and an orthonormal matrix Q satisfies QᵀQ = I, making Q⁻¹ = Qᵀ (cheap inverse).
RoPE (rotary position embeddings) uses orthogonal rotations — the positional transform is multiplication by a block-diagonal rotation matrix — because orthogonal transforms preserve dot products. That’s why RoPE can be applied to Q and K and leave the attention scores’ magnitudes intact while still encoding position information.
C.2 Probability and statistics
C.2.1 Random variables, distributions, expectation, variance
A random variable is a variable whose value is the outcome of a random process. Capital letters: X, Y. The distribution of X specifies the probability of each possible value. For discrete random variables, a probability mass function P(X = x). For continuous, a probability density function p(x) where P(a ≤ X ≤ b) = ∫ₐᵇ p(x) dx.
Expectation (mean):
E[X] = Σ x · P(X = x) (discrete)
E[X] = ∫ x · p(x) dx (continuous)
Variance:
Var[X] = E[(X - E[X])²] = E[X²] - E[X]²
Standard deviation is σ = √Var[X].
Properties you need:
- Expectation is linear:
E[aX + bY] = a E[X] + b E[Y], regardless of independence. - Variance is not linear in general. But for independent
XandY:Var[X + Y] = Var[X] + Var[Y]. - For independent variables:
E[XY] = E[X] E[Y].
C.2.2 The distributions you’ll see
Bernoulli(p): one of two outcomes with probability p and 1-p. Mean p, variance p(1-p). Every binary decision in ML is a Bernoulli.
Categorical(π): one of k outcomes with probability vector π. The output of a softmax layer is a parameter for a categorical distribution over tokens.
Gaussian / Normal N(μ, σ²): the bell curve. Density p(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²)). The default continuous distribution. Appears in activation initialization, noise models, and the “random weights” assumption behind the √d in attention.
Uniform(a, b): equal probability on [a, b]. The default “no prior information” distribution.
C.2.3 Conditional probability and Bayes’ rule
Conditional probability: P(A | B) = P(A, B) / P(B). Read “the probability of A given B.”
From the symmetric form P(A, B) = P(A | B) P(B) = P(B | A) P(A), rearrange:
P(A | B) = P(B | A) · P(A) / P(B)
This is Bayes’ rule. P(A) is the prior, P(B | A) is the likelihood, P(B) is the evidence, P(A | B) is the posterior. Most of probabilistic ML is Bayes in disguise.
In an LLM, P(next_token | context) is what the model computes. The training objective maximizes this over a corpus, which is maximum likelihood estimation.
C.2.4 Maximum likelihood and cross-entropy
Given data {x₁, ..., xₙ} and a model with parameters θ that assigns probability P(x | θ), maximum likelihood estimation picks the θ that maximizes the joint probability of the data:
θ* = argmax_θ Π P(xᵢ | θ) = argmax_θ Σ log P(xᵢ | θ)
We take the log because (a) sums are easier to compute than products, (b) log-probabilities don’t underflow, and (c) the log is monotonic so the argmax doesn’t change.
For a language model, P(xᵢ | θ) is the probability the model assigns to token xᵢ given its context. The training loss is the negative log-likelihood averaged over the corpus:
L = -(1/N) Σ log P(xᵢ | context_i, θ)
This is exactly the cross-entropy loss. “Cross-entropy” and “negative log-likelihood” are the same thing for a categorical distribution.
C.2.5 Sampling
Drawing a random value from a distribution. Inverse CDF method for continuous distributions. For categoricals (language models), sample by accumulating the CDF and rejecting above a uniform random number. The temperature, top_k, top_p parameters of Chapter 8 modify the distribution before sampling.
C.2.6 Covariance and correlation
Covariance of X and Y: Cov[X, Y] = E[(X - E[X])(Y - E[Y])]. Zero iff X and Y are uncorrelated (a weaker condition than independent). Correlation is covariance divided by the product of standard deviations, so it’s in [-1, 1].
The covariance matrix of a vector-valued random variable X of dimension d is the d × d matrix with (i, j) entry Cov[Xᵢ, Xⱼ]. The eigendecomposition of the covariance matrix is what principal component analysis (PCA) is built on. Rarely directly relevant to LLMs, but shows up in whitening, batch norm, and some retrieval techniques.
C.3 Information theory
C.3.1 Entropy
For a discrete distribution P(X), the entropy is
H(X) = -Σ P(x) log P(x)
(using the convention 0 log 0 = 0).
Units depend on the log base: bits for log₂, nats for natural log. In ML we usually use natural log, so entropy is in nats.
Intuitively, entropy measures uncertainty or average surprise. A uniform distribution over k outcomes has maximum entropy log k. A degenerate distribution (all probability on one outcome) has zero entropy.
For language, the entropy of a text distribution tells you how many bits per token a compression algorithm would need on average. Shannon’s 1951 experiments estimated the entropy of English at ~1 bit per character, which explains why text compresses so well.
C.3.2 Cross-entropy
For two distributions P and Q over the same support, the cross-entropy of Q with respect to P is
H(P, Q) = -Σ P(x) log Q(x)
Interpret it as “the expected number of bits to encode samples from P using a code designed for Q.” If Q = P, cross-entropy equals entropy. Otherwise, cross-entropy is larger.
In a language model, P is the true distribution (one-hot on the actual next token during training), and Q is the model’s predicted distribution (the softmax output). The cross-entropy loss is
L = -Σ P(token) log Q(token) = -log Q(true_token)
because P is a one-hot vector. The training objective is to minimize cross-entropy, which is the same as maximizing the log-likelihood of the true token under the model.
C.3.3 KL divergence
The Kullback-Leibler divergence from Q to P:
KL(P || Q) = Σ P(x) log (P(x) / Q(x)) = H(P, Q) - H(P)
That is, cross-entropy minus entropy. KL divergence is non-negative and zero iff P = Q.
Note the asymmetry: KL(P || Q) ≠ KL(Q || P) in general. This matters in ML:
- Forward KL
KL(P_data || P_model)is what you minimize in maximum likelihood training. It’s “zero-avoiding”: the model must put probability mass wherever the data has mass, or the KL explodes. Tends to produce over-dispersed models. - Reverse KL
KL(P_model || P_data)is what shows up in variational inference and some RL objectives. It’s “zero-forcing”: the model only needs mass where the data has mass, producing mode-seeking behavior.
KL divergence is also the regularizer in RLHF: you add β · KL(π_policy || π_reference) to the reward to keep the policy from drifting too far from the SFT model, where β trades off reward and stability.
DPO connection: the DPO paper shows that the optimal policy under a KL-regularized reward maximization has a closed form, and you can train directly on preference pairs without ever fitting a reward model. The math is Bayes’ rule plus the KL regularizer, cleverly rearranged.
C.3.4 Mutual information
For two random variables X and Y:
I(X; Y) = KL(P(X, Y) || P(X) P(Y)) = H(X) + H(Y) - H(X, Y)
The KL divergence between the joint and the product of the marginals. Zero iff X and Y are independent. Measures how much knowing X tells you about Y.
Not directly used in LLM training, but shows up in interpretability, information bottleneck frameworks, and retrieval-evaluation arguments (“the embedding should preserve the mutual information between query and relevant document”).
C.3.5 A small worked example: entropy and cross-entropy
Suppose a vocabulary of three tokens {a, b, c} and the true next-token distribution P = (0.7, 0.2, 0.1). Entropy:
H(P) = -(0.7 log 0.7 + 0.2 log 0.2 + 0.1 log 0.1)
≈ -(0.7 · -0.357 + 0.2 · -1.609 + 0.1 · -2.303)
≈ -( -0.250 - 0.322 - 0.230 )
≈ 0.802 nats
A well-trained model might predict Q = (0.6, 0.3, 0.1). Cross-entropy:
H(P, Q) = -(0.7 log 0.6 + 0.2 log 0.3 + 0.1 log 0.1)
≈ -(0.7 · -0.511 + 0.2 · -1.204 + 0.1 · -2.303)
≈ 0.828 nats
KL divergence is the difference:
KL(P || Q) = 0.828 - 0.802 = 0.026 nats
Small, as expected for a close match. If the model were completely wrong — Q = (0.1, 0.3, 0.6) — the cross-entropy would be much higher:
H(P, Q) = -(0.7 log 0.1 + 0.2 log 0.3 + 0.1 log 0.6)
≈ -(-1.612 - 0.241 - 0.051)
≈ 1.904 nats
KL ≈ 1.1 nats. That’s the training signal: the model sees a loss of 1.9 per token instead of 0.8 and has to move Q toward P.
C.3.6 Perplexity
Perplexity is the exponential of the cross-entropy (in the base of the log used):
Perplexity(P, Q) = exp(H(P, Q))
It has an interpretation as “effective number of equally likely next tokens the model is choosing between.” A perplexity of 20 means the model is as uncertain as if it were choosing uniformly among 20 tokens. Historically the main LM metric; now mostly a training diagnostic, because lower perplexity doesn’t necessarily mean better downstream task performance.
C.4 A few miscellaneous things
C.4.1 Softmax and the log-sum-exp trick
Softmax converts a vector of logits z into a probability distribution:
softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ)
Naively computing this overflows for large zᵢ because exp blows up. The standard trick: subtract the max before exponentiating:
softmax(z)ᵢ = exp(zᵢ - max(z)) / Σⱼ exp(zⱼ - max(z))
Mathematically identical (the max cancels), but now the largest exponent is zero and nothing overflows. The cross-entropy loss combines log and softmax into a single numerically stable operation:
log_softmax(z)ᵢ = zᵢ - (max(z) + log Σⱼ exp(zⱼ - max(z)))
This is the log-sum-exp trick. PyTorch’s F.cross_entropy does it for you. Never compute softmax then take the log yourself; that’s a gradient bug waiting to happen.
C.4.2 Gradients you should know
Scalar loss L, parameters W:
- Linear layer
y = Wx:dL/dW = (dL/dy) xᵀ. Outer product. - Linear layer, input gradient:
dL/dx = Wᵀ (dL/dy). - Softmax cross-entropy:
dL/dz = softmax(z) - y_onehot. This is the elegant formula that makes training stable. - ReLU:
dL/dx = dL/dyifx > 0, else 0. - LayerNorm: nonlinear in the input; PyTorch computes it for you.
The clean “softmax minus target” gradient at the output layer is one of the nice things about combining softmax and cross-entropy. That’s why they’re always implemented together in modern frameworks.
C.4.3 The chain rule made concrete
If y = f(x) and L = g(y), then
dL/dx = (dL/dy) · (dy/dx)
For multivariate versions, it’s the Jacobian matrix multiplication. Backprop is the chain rule applied mechanically through the computation graph. Autograd frameworks record each op’s local gradient and compose them.
C.4.4 The √d in attention, explained
Why divide by √d before the softmax in attention? Assume Q and K entries are independent, zero-mean, unit-variance. The dot product Q · K of two d-dimensional vectors is a sum of d products, each with variance ≈ 1, so the total variance is d. Standard deviation √d. As d grows, pre-softmax scores get larger, and softmax saturates (becomes near-one-hot), which kills gradients. Dividing by √d keeps the variance at 1 regardless of d, keeping the softmax in its useful regime.
C.4.5 Numerical precision in one paragraph
FP32 has ~7 decimal digits of precision. FP16 has ~3. BF16 has ~2-3 but the same exponent range as FP32. FP8 E4M3 has ~1 decimal digit. For matmuls, the loss of precision is usually tolerable because of averaging. For accumulations (loss, gradient clipping, layer norm statistics), you often need higher precision to avoid catastrophic cancellation or overflow. The practical rule: weights and gradients in low precision, but accumulate sums in higher precision.
C.5 A deeper linear algebra walkthrough: LoRA derived
Let’s actually derive LoRA from first principles, because this is one of the places where the abstract math pays off immediately.
Start with a pretrained linear layer y = Wx + b where W has shape (d_out, d_in). Fine-tuning means finding a new W' = W + ΔW that does better on a task. Full fine-tuning stores and updates the full ΔW, which has d_out × d_in parameters — for a 4096×4096 attention projection, 16.7M parameters per layer.
Claim 1: The fine-tuning delta ΔW is approximately low-rank. Why? A pretrained model has already learned most of the representation; the task-specific adjustment is small and lies in a low-dimensional subspace of parameter space. Empirically, the SVD of ΔW for fine-tuned models shows a sharp decay of singular values beyond the first few dozen.
Claim 2: If ΔW has rank r, it can be written as ΔW = BA where B has shape (d_out, r) and A has shape (r, d_in). This is SVD truncation in disguise: keep only the top r singular values, absorb the singular values into either factor.
Parameter count: Full fine-tune: d_out × d_in. LoRA: d_out × r + r × d_in = r × (d_out + d_in).
Example: d_out = d_in = 4096, r = 8. Full: 16,777,216 parameters. LoRA: 8 × 8192 = 65,536. Ratio: 256×.
Forward pass: y = (W + BA)x = Wx + BAx. Two matmuls: first compute Ax (shape r), then compute B(Ax) (shape d_out). Cost r × (d_in + d_out) per sample, which is small when r is small.
Initialization: A is initialized with a Gaussian, B is initialized to zero. This makes ΔW = BA = 0 at the start of training, so the initial model is exactly the pretrained one. Training then shapes A and B toward the task-specific delta.
Merging: After training, you can fold BA into W by computing W' = W + BA once and discarding the adapter — no inference overhead. The only cost is that you lose the modularity of keeping B and A separate (can’t hot-swap adapters).
This is the whole LoRA paper in math. The empirical contributions are showing that it works across many tasks, that r = 8 is enough for most things, and that it’s more memory-efficient than full fine-tuning. The math, once you see the rank-r factorization, is two lines.
C.6 The softmax + cross-entropy derivation
Another place where the math pays off: why the gradient of softmax-plus-cross-entropy is so clean.
Let z be the logit vector (length V, the vocabulary size), p = softmax(z) be the predicted probabilities, and y be the one-hot target vector (1 at the true token index t, 0 elsewhere).
Cross-entropy loss: L = -Σᵢ yᵢ log pᵢ = -log pₜ.
Compute ∂L/∂zⱼ. First:
pᵢ = exp(zᵢ) / Σₖ exp(zₖ)
Gradient of pᵢ with respect to zⱼ:
- If
i = j:∂pᵢ/∂zⱼ = pᵢ(1 - pᵢ). - If
i ≠ j:∂pᵢ/∂zⱼ = -pᵢpⱼ.
This can be written compactly as ∂pᵢ/∂zⱼ = pᵢ(δᵢⱼ - pⱼ), where δᵢⱼ is 1 if i = j and 0 otherwise.
Now ∂L/∂zⱼ = ∂(-log pₜ)/∂zⱼ = -(1/pₜ) · ∂pₜ/∂zⱼ = -(1/pₜ) · pₜ(δₜⱼ - pⱼ) = -(δₜⱼ - pⱼ) = pⱼ - δₜⱼ.
That is: ∂L/∂z = p - y. The gradient of the loss with respect to the pre-softmax logits is simply (predicted probabilities) minus (target one-hot). Clean, numerically stable, and the direct training signal.
This is why LLM training frameworks fuse softmax and cross-entropy into a single op: the joint gradient is cheap and the intermediate p never needs to be written back to memory.
C.7 Probability drilldown: Bayes for LLMs
Most “probabilistic interpretation of LLMs” talks are Bayes in disguise. Make it concrete.
A language model assigns probability P(w_{1:n}) to a sequence of tokens. By the chain rule of probability (not the chain rule of calculus — different thing, same name):
P(w_{1:n}) = P(w_1) · P(w_2 | w_1) · P(w_3 | w_1, w_2) · ... · P(w_n | w_1, ..., w_{n-1})
A causal LM factorizes the joint as a product of conditionals. Each conditional P(w_t | w_{<t}) is what the softmax output at position t gives you.
Training maximizes log P(w_{1:n}) over a corpus, which is equivalent to minimizing cross-entropy at each position. That’s the entire training objective.
At inference, you want to sample from P(w_{next} | w_{so far}). Temperature sampling scales the logits (dividing by T) before softmax, which sharpens (T < 1) or flattens (T > 1) the distribution. Top-k truncates to the k most likely tokens and renormalizes. Top-p truncates to the smallest set whose cumulative probability exceeds p.
Bayes’ rule enters when you want to condition on something external. For RAG:
P(answer | question, docs) ∝ P(answer | docs) · P(docs | question)
The retriever computes P(docs | question), the generator computes P(answer | docs, question). Approximately. The full Bayesian version doesn’t usually get implemented — you just concatenate retrieved docs into the prompt — but the mental model is useful.
For DPO, the derivation starts from a KL-regularized RL objective:
max_π E_{x~D, y~π(x)}[r(x,y)] - β · KL(π(·|x) || π_ref(·|x))
Using variational calculus (specifically, Lagrange multipliers applied to the KL term), the optimal policy has the form:
π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β)
where Z(x) is a normalizing constant. Take the log and solve for the reward:
r(x,y) = β · log(π*(y|x) / π_ref(y|x)) + β · log Z(x)
The second term only depends on x, not y, so it cancels in any pairwise comparison. The DPO loss replaces r in the preference likelihood (Bradley-Terry model) with this implicit reward, eliminating the need for a separate reward model. The machinery is pure Bayes plus variational calculus.
C.8 Information theory drilldown: why cross-entropy is the loss
You can train an LLM with many loss functions. Why cross-entropy specifically?
Three answers.
Answer 1: maximum likelihood. Cross-entropy between the empirical distribution (one-hot on the true token) and the model distribution is the negative log-likelihood of the data under the model. Minimizing it is the same as finding the model that makes the observed data most probable. This is a clean statistical principle.
Answer 2: information-theoretic. Cross-entropy measures the expected number of bits needed to encode samples from the true distribution P using a code optimized for the model’s distribution Q. It equals the entropy of P (an intrinsic property) plus the KL divergence from P to Q (the model’s “extra cost”). Minimizing cross-entropy minimizes the KL divergence, driving the model distribution toward the true one.
Answer 3: it has the right gradient. Recall ∂L/∂z = p - y. This is bounded (each component is in [-1, 1]), well-scaled, and points directly at the error direction. Competing losses like squared error on the softmax output have worse gradient properties (the gradient scales with p(1-p), which can vanish for confident predictions). Cross-entropy stays well-behaved throughout training.
All three answers point to the same thing. The fact that they agree is not a coincidence — it’s what makes the training recipe work across architectures.
C.9 Tiny worked problems
To check that this refresher actually stuck, try these. Answers at the end.
Problem 1. Let A be a 4×4 matrix of all ones. What is its rank?
Problem 2. You have a weight matrix of shape (4096, 4096) and want to represent its delta with a rank-16 LoRA. How many parameters does the adapter have?
Problem 3. A model outputs logits z = (2, 1, 0, -1). What are the softmax probabilities? What’s the cross-entropy loss if the true class is index 1?
Problem 4. A coin is biased: P(heads) = 0.7. What’s the entropy in bits?
Problem 5. Two distributions: P = (0.5, 0.5), Q = (0.9, 0.1). What’s KL(P || Q) in nats?
Problem 6. If a 70B model has 80 layers and a hidden dimension of 8192, what’s the rough size of one attention Q projection matrix in BF16?
Problem 7. What’s the variance of a dot product of two d-dimensional random vectors with independent unit-variance zero-mean components?
Answers:
-
Rank 1. All rows are identical, so only one independent row exists.
-
16 × (4096 + 4096) = 131,072parameters. Compared to4096 × 4096 = 16.7Mfor a full delta, that’s 128× smaller. -
exp(z) = (7.389, 2.718, 1.000, 0.368), sum = 11.475. Sop ≈ (0.644, 0.237, 0.087, 0.032). Cross-entropy for class 1 is-log(0.237) ≈ 1.440 nats(or≈2.08 bits). -
H = -(0.7 log₂ 0.7 + 0.3 log₂ 0.3) ≈ -(0.7 · -0.515 + 0.3 · -1.737) ≈ 0.881 bits. -
KL(P||Q) = 0.5 log(0.5/0.9) + 0.5 log(0.5/0.1) = 0.5 · -0.588 + 0.5 · 1.609 ≈ 0.511 nats. -
Shape
(8192, 8192). In BF16,8192 × 8192 × 2 = 134.2 MB. Per layer. Times 80 layers times 4 projection matrices (Q, K, V, output) — that’s ~43 GB just for the attention projections. Most of the 140 GB model weight comes from FFNs, not attention, for non-GQA models. -
Each component product
Xᵢ · Yᵢhas variance 1. Sum of d independent such products has variance d. Standard deviation√d. This is why attention divides by√d_k.
C.10 The math you don’t need
Things that show up in ML textbooks but you can ignore for the rest of this book:
- Explicit calculus of variations. Needed for theoretical derivations (like the DPO one above), but you never compute it by hand in practice.
- Convex analysis. Nobody proves convergence for deep learning. The field works by empirical trial and error.
- Tensor calculus (Einstein notation, index gymnastics). Useful if you write CUDA kernels from scratch, unnecessary otherwise. PyTorch’s
einsumgives you the notation without the hand-calculation. - Measure theory. Needed for PhD-level probability theory. Irrelevant for ML engineering.
- Differential geometry. Shows up in some theoretical papers (natural gradients, Riemannian optimization). Skip.
If a chapter in this book or a paper you’re reading requires something beyond this appendix, it’s usually because a derivation is being shown for rhetorical effect. Read the words, not the symbols, and move on.
That’s it. Everything in the rest of the book assumes you can read these operations, not that you can derive them. If a chapter feels opaque, come back here and look up the specific operation. The entries are short on purpose.