Chapter 124: The coding interview: twenty ML systems algorithms

Every Part X chapter so far has focused on the design-round format — clarify, estimate, design, drill, ops. But many loops at senior levels include a 45-minute coding round that sits next to the system design, and it tests a tighter scope: can you implement, from scratch and without documentation, the algorithms that make the system work? The ten minutes you spend sketching a KV cache in the design round will be followed, at some companies, by a coding round that says “now implement it.”

The twenty algorithms in this chapter are the ones that show up most often. They are not data-structure puzzles. They are ML primitives — the numerical routines, the serving logic, the retrieval plumbing — that a senior ML engineer should be able to write correctly under pressure. Getting them right demonstrates something that a whiteboard diagram never can: you understand the operation at the level of shapes, dtypes, numerical edge cases, and off-by-one errors.

The interviewers grading this round are not looking for elegance. They are looking for three things: correct behavior on the normal case, awareness of the numerical traps, and knowledge of what to check when the output looks wrong. A candidate who writes a slightly verbose but numerically stable softmax and immediately explains the max-subtract trick is a candidate who has shipped code. A candidate who writes the naive version and says “I think this is fine” is a candidate who has read slides.

§124.1 Numerically stable softmax

Problem statement. Implement softmax over a 1-D array of logits. Then implement log-softmax (for use in cross-entropy), and explain why numerical stability matters.

Why they’re asking. Softmax is the first filter. Every candidate writes it; the question is whether they write the version that explodes on large logits. The interviewer wants to see (a) the max-subtract trick stated before code is written, not discovered after the output is nan, and (b) the log-sum-exp formulation for computing log-probabilities without materializing the full softmax.

import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    """Numerically stable softmax over last axis."""
    # Subtract max to prevent exp overflow (no change to output — divides out)
    x = x - x.max(axis=-1, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=-1, keepdims=True)

def log_softmax(x: np.ndarray) -> np.ndarray:
    """log(softmax(x)) without materializing softmax — avoids log(0)."""
    x_max = x.max(axis=-1, keepdims=True)
    log_sum_exp = np.log(np.exp(x - x_max).sum(axis=-1, keepdims=True)) + x_max
    return x - log_sum_exp

# Quick sanity checks
logits = np.array([1.0, 2.0, 3.0])
probs  = softmax(logits)
assert abs(probs.sum() - 1.0) < 1e-6
assert np.allclose(np.log(probs), log_softmax(logits), atol=1e-6)

# Large logit case — naive exp would overflow to inf
large = np.array([1000.0, 1001.0, 1002.0])
assert not np.any(np.isnan(softmax(large)))

The common bugs.

Subtracting mean instead of max. Mean-subtract is not wrong per se but does not prevent overflow; max-subtract does.
Writing np.exp(x).sum() before x - x.max(). The subtraction must happen first.
Using log(softmax(x)) for cross-entropy loss instead of log_softmax. Squashing probabilities close to zero through log after softmax loses precision; log_softmax keeps it.

Stretch follow-ups.

“Now make this work on a batch of logits (shape [B, V]) without a loop.” The keepdims=True already handles it; note the candidate must confirm axes.
“What happens to softmax when two logits are equal and very large?” Numerically fine after max-subtract; still worth saying out loud.

§124.2 Scaled dot-product attention

Problem statement. Given query Q, key K, and value V matrices (shapes [seq, d_k], [seq, d_k], [seq, d_v]), implement scaled dot-product attention. Add an optional boolean causal mask.

Why they’re asking. This is the kernel of every transformer. The interviewer will let the candidate write it and then ask about shapes — wrong shapes are the most common failure. They will also probe the sqrt(d_k) denominator: can the candidate explain why it’s there (variance preservation under random initialization), or do they just copy it?

import numpy as np

def scaled_dot_product_attention(
    Q: np.ndarray,          # [seq_q, d_k]
    K: np.ndarray,          # [seq_k, d_k]
    V: np.ndarray,          # [seq_k, d_v]
    causal: bool = False,
) -> np.ndarray:            # [seq_q, d_v]
    d_k = Q.shape[-1]
    # Scores: [seq_q, seq_k]
    scores = Q @ K.T / np.sqrt(d_k)

    if causal:
        seq_q, seq_k = scores.shape
        # Upper-triangular mask (positions i should not attend to j > i)
        mask = np.triu(np.ones((seq_q, seq_k), dtype=bool), k=1)
        scores = np.where(mask, -1e9, scores)

    weights = softmax(scores)   # [seq_q, seq_k]
    return weights @ V           # [seq_q, d_v]

def softmax(x):
    x = x - x.max(axis=-1, keepdims=True)
    e = np.exp(x)
    return e / e.sum(axis=-1, keepdims=True)

The common bugs.

Transposing Q instead of K: K @ Q.T is wrong for non-square queries.
Using k=0 in np.triu for the causal mask, which masks the diagonal and prevents each position from attending to itself.
Forgetting to apply softmax before the final matmul — a surprisingly common error under pressure.

Stretch follow-ups.

“What if seq_q != seq_k? When does that happen?” Cross-attention in encoder-decoder models; also prefill vs decode in KV cache.
“How does the mask change for padding tokens vs causal?” Padding masks zero out specific positions; the candidate should show np.where(pad_mask, -1e9, scores).

§124.3 Multi-head attention

Problem statement. Implement multi-head attention. You have weight matrices W_Q, W_K, W_V (each [d_model, d_model]) and W_O ([d_model, d_model]). Accept x of shape [seq, d_model].

Why they’re asking. Multi-head attention is where shape bugs flourish. The reshape-and-transpose dance ([seq, n_heads, d_k] → [n_heads, seq, d_k]) trips up everyone who has only seen it in high-level APIs. The interviewer wants to see the reshape done cleanly and the output concatenation done correctly.

import numpy as np

def multi_head_attention(
    x: np.ndarray,           # [seq, d_model]
    W_Q: np.ndarray,         # [d_model, d_model]
    W_K: np.ndarray,         # [d_model, d_model]
    W_V: np.ndarray,         # [d_model, d_model]
    W_O: np.ndarray,         # [d_model, d_model]
    n_heads: int,
    causal: bool = False,
) -> np.ndarray:             # [seq, d_model]
    seq, d_model = x.shape
    d_k = d_model // n_heads
    assert d_model % n_heads == 0

    # Project
    Q = x @ W_Q              # [seq, d_model]
    K = x @ W_K
    V = x @ W_V

    # Split into heads: [seq, d_model] -> [n_heads, seq, d_k]
    def split_heads(t):
        t = t.reshape(seq, n_heads, d_k)    # [seq, n_heads, d_k]
        return t.transpose(1, 0, 2)          # [n_heads, seq, d_k]

    Q_h, K_h, V_h = split_heads(Q), split_heads(K), split_heads(V)

    # Attend per head
    heads = []
    for h in range(n_heads):
        heads.append(scaled_dot_product_attention(Q_h[h], K_h[h], V_h[h], causal))

    # Concat heads: [n_heads, seq, d_k] -> [seq, d_model]
    out = np.concatenate(heads, axis=-1)  # [seq, d_model]
    return out @ W_O

def scaled_dot_product_attention(Q, K, V, causal=False):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    if causal:
        mask = np.triu(np.ones(scores.shape, dtype=bool), k=1)
        scores = np.where(mask, -1e9, scores)
    e = np.exp(scores - scores.max(axis=-1, keepdims=True))
    w = e / e.sum(axis=-1, keepdims=True)
    return w @ V

The common bugs.

Transposing to [seq, n_heads, d_k] and iterating over the wrong axis — always transpose to [n_heads, seq, d_k] before iterating.
Concatenating on axis 0 (heads axis) instead of axis -1 (feature axis).
Forgetting W_O — the output projection is not optional; without it, heads cannot interact.

Stretch follow-ups.

“Vectorize away the head loop with a batched matmul.” The candidate should sketch np.einsum or batched @ over the [n_heads, seq, d_k] tensor.
“What is GQA?” Multiple query heads share one key-value head — K_h and V_h are replicated across a group of query heads.

§124.4 BPE tokenizer training loop

Problem statement. Implement the training loop for Byte-Pair Encoding (BPE). Input: a list of pre-tokenized strings (already split on whitespace). Target: a merge table of num_merges merges.

Why they’re asking. BPE training is a classic pair-frequency problem. It exposes whether the candidate can maintain a correct frequency count under iterative merging without restarting from scratch each round. The weak answer rebuilds the frequency table from scratch every merge; the strong answer updates it incrementally.

from collections import defaultdict
from typing import List, Tuple, Dict

def train_bpe(
    corpus: List[str],      # ["hello world", "world peace", ...]
    num_merges: int,
) -> List[Tuple[str, str]]:
    """Return the list of (a, b) merge rules in order of application."""
    # Represent each word as a tuple of characters + end-of-word marker
    vocab: Dict[Tuple[str, ...], int] = defaultdict(int)
    for sentence in corpus:
        for word in sentence.split():
            vocab[tuple(word) + ("</w>",)] += 1

    merges: List[Tuple[str, str]] = []

    for _ in range(num_merges):
        # Count all adjacent pairs
        pair_freq: Dict[Tuple[str, str], int] = defaultdict(int)
        for word, freq in vocab.items():
            for a, b in zip(word, word[1:]):
                pair_freq[(a, b)] += freq

        if not pair_freq:
            break

        # Pick the most frequent pair
        best = max(pair_freq, key=pair_freq.get)
        merges.append(best)

        # Merge every occurrence of best in vocab
        new_vocab: Dict[Tuple[str, ...], int] = {}
        a, b = best
        for word, freq in vocab.items():
            new_word = []
            i = 0
            while i < len(word):
                if i < len(word) - 1 and word[i] == a and word[i + 1] == b:
                    new_word.append(a + b)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_vocab[tuple(new_word)] = freq
        vocab = new_vocab

    return merges

The common bugs.

Iterating over characters instead of symbols — after merges, symbols are multi-character; iterating over characters restarts from scratch.
Off-by-one in the merge loop: missing the last pair when zip stops one short — zip(word, word[1:]) is correct; range(len(word) - 1) is the safe alternative.
Rebuilding pair_freq by scanning the original string corpus instead of the current vocab representation — this ignores all previous merges.

Stretch follow-ups.

“How does inference-time encoding (encoding a new string with a trained BPE model) work?” Apply merges in order; the longest-match greedy approach is the standard.
“Why does tiktoken use a regex pre-tokenizer before BPE?” To prevent merges across word boundaries, digits, and punctuation — producing more predictable token boundaries.

§124.5 Top-k / top-p (nucleus) sampling

Problem statement. Given a 1-D array of logits, implement a function that returns a sampled token index using top-k filtering, top-p (nucleus) filtering, and temperature — all in one function, applied in that order.

Why they’re asking. Sampling is the last step before text appears, and every combination of flags produces a different distribution. The interviewer wants to see: (a) temperature applied before masking, (b) top-k mask applied before top-p, (c) correct handling of the edge case where top-p cuts out the only remaining token.

import numpy as np

def sample_token(
    logits: np.ndarray,     # [vocab_size]
    temperature: float = 1.0,
    top_k: int = 0,         # 0 = disabled
    top_p: float = 1.0,     # 1.0 = disabled
) -> int:
    # 1. Temperature scaling
    logits = logits / max(temperature, 1e-8)

    # 2. Top-k: zero out all but the k largest logits
    if top_k > 0:
        k = min(top_k, len(logits))
        threshold = np.sort(logits)[-k]
        logits = np.where(logits < threshold, -1e9, logits)

    # 3. Convert to probabilities
    probs = softmax(logits)

    # 4. Top-p (nucleus): find the smallest set of tokens
    #    whose cumulative probability >= top_p
    if top_p < 1.0:
        sorted_idx   = np.argsort(probs)[::-1]        # highest first
        sorted_probs = probs[sorted_idx]
        cumulative   = np.cumsum(sorted_probs)
        # Remove tokens once cumulative > top_p (but keep at least one)
        cutoff = np.searchsorted(cumulative, top_p, side="right")
        cutoff = max(cutoff, 0)                        # keep at least 1 token
        removed = sorted_idx[cutoff + 1:]
        probs[removed] = 0.0
        probs /= probs.sum()                           # renormalize

    # 5. Sample
    return int(np.random.choice(len(probs), p=probs))

def softmax(x):
    x = x - x.max()
    e = np.exp(x)
    return e / e.sum()

The common bugs.

Applying top-p before converting to probabilities — top-p is defined in probability space, not logit space.
Off-by-one in the nucleus cutoff: cumulative > top_p removes the token that pushes cumulative over top_p, but that token should be included.
Division by zero if all probs are zeroed out by aggressive top-k + top-p combination — always keep at least one token.

Stretch follow-ups.

“What is min-p sampling?” Remove any token with probability less than min_p * max_prob. Simpler than top-p and avoids the cumsum.
“Why is temperature 0 not truly deterministic in practice?” Hardware non-determinism in floating-point reduction order; argmax (greedy) is the true deterministic baseline.

§124.6 Beam search

Problem statement. Implement beam search. Given a function score_fn(token_ids) -> np.ndarray[vocab_size] that returns log-probabilities for the next token, produce the top-beam_width complete sequences up to max_len tokens.

Why they’re asking. Beam search is the canonical dynamic-programming decoding algorithm. The interviewer wants to see correct log-prob accumulation (addition, not multiplication), proper termination on EOS, and no hidden bugs in beam expansion order.

import numpy as np
from typing import Callable, List, Tuple

EOS = 1  # assume token ID 1 is end-of-sequence

def beam_search(
    score_fn: Callable,     # score_fn(list[int]) -> np.ndarray[vocab_size] (log-probs)
    beam_width: int,
    max_len: int,
    vocab_size: int,
    bos_id: int = 0,
) -> List[Tuple[float, List[int]]]:
    """Return list of (score, token_ids) sorted best first."""
    # Each beam: (cumulative_log_prob, token_ids)
    beams: List[Tuple[float, List[int]]] = [(0.0, [bos_id])]
    completed: List[Tuple[float, List[int]]] = []

    for _ in range(max_len):
        candidates: List[Tuple[float, List[int]]] = []

        for log_prob, tokens in beams:
            if tokens[-1] == EOS:
                completed.append((log_prob, tokens))
                continue

            # Get log-probs for next token
            next_log_probs = score_fn(tokens)        # [vocab_size]

            # Expand: pick top beam_width tokens to keep search tractable
            top_k = min(beam_width, vocab_size)
            top_ids = np.argsort(next_log_probs)[-top_k:]
            for t in top_ids:
                candidates.append((
                    log_prob + next_log_probs[t],
                    tokens + [int(t)],
                ))

        if not candidates:
            break

        # Keep only the best beam_width live beams
        candidates.sort(key=lambda x: x[0], reverse=True)
        beams = candidates[:beam_width]

    # Collect any beams that did not hit EOS
    completed.extend(beams)
    completed.sort(key=lambda x: x[0], reverse=True)
    return completed[:beam_width]

At each step beam search keeps only the top-B cumulative log-probability paths — pruning is the entire point; the best final sequence need not pass through the best single-step token.

The common bugs.

Accumulating probabilities multiplicatively instead of log-probs additively — multiplicative scores underflow to zero within a few tokens.
Treating a beam that reached EOS as still live — it should move to the completed list, not be expanded.
Expanding all beam_width beams to vocab_size candidates without truncating to beam_width before keeping — this creates O(beam_width × vocab_size) candidates; the sort + truncate step is mandatory.

Stretch follow-ups.

“Why did beam search fall out of favor for LLMs?” It degrades with large models because the highest-probability sequences tend to be degenerate repetitions; sampling with temperature works better in practice.
“What is length normalization?” Divide cumulative log-prob by sequence length raised to a penalty exponent to prevent short sequences from dominating.

§124.7 KV cache

Problem statement. Implement an append-only KV cache for single-token decoding. The cache holds previously computed keys and values. On each decode step, append the new key and value, then run attention over cache + current query.

Why they’re asking. The KV cache is the primary memory structure in autoregressive inference. The interviewer is checking: (a) the candidate knows it is append-only (no recomputation), (b) the candidate correctly handles the query being length-1 while keys/values grow, and (c) the candidate knows the cache’s memory cost in bytes.

import numpy as np
from typing import Optional

class KVCache:
    """Single-layer, single-head KV cache for autoregressive decoding."""

    def __init__(self, d_k: int, d_v: int, max_seq: int):
        # Pre-allocate; avoids repeated reallocation
        self.K = np.zeros((max_seq, d_k), dtype=np.float32)
        self.V = np.zeros((max_seq, d_v), dtype=np.float32)
        self.length = 0

    def append(self, k: np.ndarray, v: np.ndarray):
        """k, v shapes: [d_k], [d_v] — single token."""
        assert self.length < self.K.shape[0], "cache full"
        self.K[self.length] = k
        self.V[self.length] = v
        self.length += 1

    def attend(self, q: np.ndarray) -> np.ndarray:
        """q shape: [d_k]. Returns context vector [d_v]."""
        K = self.K[:self.length]           # [seq_so_far, d_k]
        V = self.V[:self.length]           # [seq_so_far, d_v]
        scores = K @ q / np.sqrt(q.shape[-1])  # [seq_so_far]
        scores = scores - scores.max()
        w = np.exp(scores)
        w /= w.sum()                       # softmax weights
        return w @ V                       # [d_v]

def decode_step(
    x_embed: np.ndarray,    # [d_model] — embedding of the new token
    W_Q, W_K, W_V,          # [d_model, d_model] each
    cache: KVCache,
) -> np.ndarray:            # [d_v] — context for this step
    q = x_embed @ W_Q       # [d_k]
    k = x_embed @ W_K
    v = x_embed @ W_V
    cache.append(k, v)
    return cache.attend(q)

The common bugs.

Running attention over only the current token’s K/V instead of the full cache — this negates the point of the cache.
Growing the cache as a Python list and converting to numpy each step — pre-allocation with a length counter is O(1) per append.
Forgetting that the causal mask is trivially satisfied at decode time: every previously cached position is already in the past.

Stretch follow-ups.

“What is PagedAttention?” It replaces the contiguous pre-allocated buffer with block-paged storage so KV memory can be shared across requests and reclaimed on completion.
“How does the cache scale with batch size and sequence length?” Bytes = batch × seq × n_layers × 2 × d_model × sizeof(dtype). For a 70B model: 80 layers × 2 × 8192 × 2 bytes (bf16) × seq × batch — quickly gigabytes.

§124.8 LoRA forward pass

Problem statement. Implement the LoRA forward pass. A frozen weight matrix W ([out_dim, in_dim]) is adapted by two low-rank matrices A ([r, in_dim]) and B ([out_dim, r]). Scaling factor is alpha / r. Input x is [in_dim].

Why they’re asking. LoRA is the dominant fine-tuning method in production. The interviewer wants to verify: (a) the candidate can write the additive residual correctly, (b) scaling by alpha/r is applied to the low-rank term (not W), and (c) the candidate knows why B is initialized to zeros at the start of training.

import numpy as np

class LoRALinear:
    """LoRA-adapted linear layer. W is frozen; A and B are trained."""

    def __init__(
        self,
        W: np.ndarray,      # [out_dim, in_dim], frozen
        r: int,             # rank
        alpha: float,       # scaling hyperparameter
    ):
        out_dim, in_dim = W.shape
        self.W = W
        self.r = r
        self.scale = alpha / r
        # Standard LoRA init: A ~ N(0, 1/r), B = 0
        self.A = np.random.randn(r, in_dim) / np.sqrt(r)   # [r, in_dim]
        self.B = np.zeros((out_dim, r))                     # [out_dim, r]

    def forward(self, x: np.ndarray) -> np.ndarray:
        """x: [in_dim] -> [out_dim]."""
        base   = self.W @ x                          # frozen path
        lora   = self.B @ (self.A @ x) * self.scale  # low-rank path
        return base + lora

# Sanity: at init, B=0 means LoRA contributes nothing — the adapted model
# starts identical to the base model.
W = np.random.randn(64, 32)
layer = LoRALinear(W, r=4, alpha=16.0)
x = np.random.randn(32)
base_out = W @ x
lora_out = layer.forward(x)
assert np.allclose(base_out, lora_out), "at init, LoRA must be a no-op"

The common bugs.

Scaling the wrong term: (W + scale * B @ A) @ x is correct algebra but mixes frozen and trained parameters; keep B @ A @ x as a separate term so W stays frozen.
Initializing B randomly — B must be zero at init so the LoRA delta starts at zero. A non-zero B would immediately shift the model away from the base checkpoint.
Applying scale outside the forward pass (e.g., at merge time only) without accounting for it during training — loss gradients will be miscalibrated.

Stretch follow-ups.

“How do you merge LoRA back into W for deployment?” W_merged = W + scale * B @ A. Zero deployment overhead after merge.
“What is DoRA?” Decomposes W into magnitude and direction; updates direction with LoRA and magnitude with a scalar — better learning dynamics on some tasks.

§124.9 Cross-entropy loss with label smoothing

Problem statement. Implement cross-entropy loss from scratch. Add an optional label-smoothing parameter epsilon.

Why they’re asking. Cross-entropy is the loss function for language modeling. The interview checks: (a) the candidate uses log_softmax, not log(softmax), (b) label smoothing is applied correctly (distribute epsilon across all classes uniformly), and (c) the candidate can explain why label smoothing helps calibration.

import numpy as np

def cross_entropy(
    logits: np.ndarray,     # [vocab_size] or [batch, vocab_size]
    targets: np.ndarray,    # int scalar or [batch] of class indices
    epsilon: float = 0.0,   # label smoothing in [0, 1)
) -> float:
    """Cross-entropy loss with optional label smoothing."""
    single = logits.ndim == 1
    if single:
        logits  = logits[None]   # [1, V]
        targets = np.array([targets])

    batch, V = logits.shape
    log_probs = log_softmax(logits)  # [batch, V]

    if epsilon == 0.0:
        # Hard labels: pick the log-prob of the target class
        loss = -log_probs[np.arange(batch), targets]
    else:
        # Smooth labels: mix target with uniform distribution
        # true target gets (1 - epsilon), every class gets epsilon / V
        smooth = np.full_like(log_probs, epsilon / V)
        smooth[np.arange(batch), targets] += (1.0 - epsilon)
        loss = -(smooth * log_probs).sum(axis=-1)

    return float(loss.mean()) if not single else float(loss[0])

def log_softmax(x):
    x_max = x.max(axis=-1, keepdims=True)
    lse = np.log(np.exp(x - x_max).sum(axis=-1, keepdims=True)) + x_max
    return x - lse

The common bugs.

Using np.log(softmax(logits)) instead of log_softmax — precision loss when probabilities are very small.
Label smoothing that makes the smooth target sum to more than 1: the target class gets (1 - epsilon) + epsilon/V and all others get epsilon/V; that sums to 1.
Averaging over the vocabulary dimension before summing labels — the dot product (smooth * log_probs).sum(axis=-1) is correct; .mean(axis=-1) is not.

Stretch follow-ups.

“Why does label smoothing improve calibration?” The model can no longer drive the target logit to +∞ without incurring loss from the smoothed non-target classes; confidence is bounded.
“How does temperature scaling (post-hoc calibration) differ from label smoothing?” Temperature scales logits at inference time to adjust confidence; label smoothing regularizes during training.

§124.10 RoPE rotation

Problem statement. Implement Rotary Position Embedding (RoPE). Precompute the cosine/sine rotation matrices for a given maximum sequence length and head dimension, then apply them to a query or key tensor.

Why they’re asking. RoPE is the position encoding in every modern LLM (Llama, Mistral, Qwen). The interviewer wants to see: (a) correct precomputation of θ frequencies, (b) the rotation applied pair-wise over the head dimension, and (c) the candidate noting that RoPE only touches Q and K, never V.

import numpy as np

def precompute_rope(d: int, max_seq: int, base: float = 10000.0):
    """Precompute cos and sin matrices for RoPE.
    Returns cos, sin each of shape [max_seq, d//2].
    """
    # Frequencies: theta_i = 1 / base^(2i / d) for i in [0, d/2)
    i = np.arange(d // 2, dtype=np.float32)
    theta = 1.0 / (base ** (2 * i / d))          # [d//2]

    positions = np.arange(max_seq, dtype=np.float32)  # [max_seq]
    angles = np.outer(positions, theta)            # [max_seq, d//2]
    return np.cos(angles), np.sin(angles)

def apply_rope(
    x: np.ndarray,          # [seq, d]  (Q or K)
    cos: np.ndarray,        # [max_seq, d//2]
    sin: np.ndarray,        # [max_seq, d//2]
    offset: int = 0,        # position offset for decode-time cache
) -> np.ndarray:
    seq, d = x.shape
    assert d % 2 == 0
    # Split into pairs: x1 = even dims, x2 = odd dims
    x1, x2 = x[:, :d//2], x[:, d//2:]
    c = cos[offset : offset + seq]   # [seq, d//2]
    s = sin[offset : offset + seq]
    # Rotate: [x1, x2] -> [x1*cos - x2*sin, x1*sin + x2*cos]
    return np.concatenate([x1 * c - x2 * s, x1 * s + x2 * c], axis=-1)

# RoPE is only applied to Q and K, never to V.

The common bugs.

Applying RoPE to V — rotations encode position into Q and K; V carries content and should not be rotated.
Splitting the head dimension as [0::2] and [1::2] (interleaved) instead of [:d//2] and [d//2:] (contiguous halves). The math is the same; the layout differs. Match the convention used in the checkpoint.
Forgetting the offset parameter in decode-time inference — position IDs must reflect the token’s actual position in the sequence, not its position within the current forward pass.

Stretch follow-ups.

“What is YaRN / LongRoPE?” Extended-context techniques that interpolate or re-scale the θ frequencies to allow longer sequences than the training context without full re-training.
“What breaks if you rotate in bfloat16?” Small angles lose precision for positions near zero; warm-up with fp32 and cast down, or keep the cos/sin table in fp32.

§124.11 Layer normalization (RMSNorm)

Problem statement. Implement RMSNorm — the variant of layer normalization used in Llama, Mistral, and most modern decoder-only models. Accept a learned scale vector gamma.

Why they’re asking. The distinction between LayerNorm and RMSNorm matters: RMSNorm drops the mean-centering and bias terms, reducing compute. Interviewers ask it to see if the candidate knows why the mean is dropped (empirically, centering adds cost without accuracy gain in large models) and whether they can derive the formula from definition.

import numpy as np

def rms_norm(
    x: np.ndarray,          # [..., d_model]
    gamma: np.ndarray,      # [d_model], learned scale
    eps: float = 1e-6,
) -> np.ndarray:
    """RMSNorm: normalize by root-mean-square, then scale by gamma."""
    # RMS over the last dimension
    rms = np.sqrt((x ** 2).mean(axis=-1, keepdims=True) + eps)
    return (x / rms) * gamma

def layer_norm(
    x: np.ndarray,          # [..., d_model]
    gamma: np.ndarray,      # [d_model]
    beta: np.ndarray,       # [d_model], bias (not in RMSNorm)
    eps: float = 1e-5,
) -> np.ndarray:
    """Standard LayerNorm for reference."""
    mean = x.mean(axis=-1, keepdims=True)
    var  = ((x - mean) ** 2).mean(axis=-1, keepdims=True)
    return gamma * (x - mean) / np.sqrt(var + eps) + beta

The common bugs.

Using eps inside the square root as np.sqrt(rms² + eps) instead of np.sqrt(rms² + eps) — they are the same; the bug is computing mean of x² first and adding eps before the sqrt, which is correct, vs adding eps after the sqrt, which computes rms + eps (wrong).
Applying gamma as a matrix multiply instead of element-wise: it’s a scale, not a projection.
Forgetting to set eps — without it, a zero-activation vector causes division by zero.

Stretch follow-ups.

“Why does pre-norm (norm before attention) train more stably than post-norm?” Gradient scale is bounded at each residual branch entry; post-norm can cause very deep models to have vanishing gradients in early layers.
“What is the memory cost of storing norms vs activations in the backward pass?” Norms need to store the pre-norm activations for the backward pass — same cost as any activation checkpoint.

§124.12 Adam / AdamW optimizer step

Problem statement. Implement a single Adam optimizer step from scratch. Then extend it to AdamW (decoupled weight decay).

Why they’re asking. Optimizer state is the primary memory cost in training after the model weights. The interviewer wants to see: (a) the bias-correction terms, (b) the eps in the right place (denominator, not inside the square root), and (c) why AdamW differs from L2 regularization.

import numpy as np
from typing import Dict

class AdamW:
    """AdamW optimizer — one parameter group."""

    def __init__(
        self,
        lr: float = 1e-3,
        beta1: float = 0.9,
        beta2: float = 0.999,
        eps: float = 1e-8,
        weight_decay: float = 0.01,
    ):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.wd = weight_decay
        self.m: Dict[str, np.ndarray] = {}   # first moment
        self.v: Dict[str, np.ndarray] = {}   # second moment
        self.t = 0

    def step(self, params: Dict[str, np.ndarray], grads: Dict[str, np.ndarray]):
        self.t += 1
        b1, b2 = self.beta1, self.beta2

        for name, p in params.items():
            g = grads[name]
            if name not in self.m:
                self.m[name] = np.zeros_like(p)
                self.v[name] = np.zeros_like(p)

            # Update biased first and second moment estimates
            self.m[name] = b1 * self.m[name] + (1 - b1) * g
            self.v[name] = b2 * self.v[name] + (1 - b2) * g ** 2

            # Bias-corrected estimates
            m_hat = self.m[name] / (1 - b1 ** self.t)
            v_hat = self.v[name] / (1 - b2 ** self.t)

            # AdamW: apply weight decay directly to weights (not via gradient)
            p -= self.lr * self.wd * p

            # Adam update
            p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

The common bugs.

Missing bias correction — in early steps, m and v are both biased toward zero, which dramatically under-estimates the gradient signal.
Putting eps inside the square root: np.sqrt(v_hat + eps) instead of np.sqrt(v_hat) + eps. The two are numerically different; the PyTorch convention is eps outside the root.
Implementing L2 regularization (adding wd * p to the gradient) instead of AdamW (applying wd directly to the weight). L2 interacts with the adaptive step size; AdamW does not.

Stretch follow-ups.

“What is the memory cost of Adam vs SGD for a 70B model?” Adam stores m and v — 2× the parameter count. For 70B fp32 parameters: 70B × 3 × 4 bytes ≈ 840 GB just for optimizer state.
“What is Muon / Shampoo?” Second-order or matrix-preconditioned optimizers that estimate curvature across the full parameter matrix, not per-element.

§124.13 Gradient accumulation wrapper

Problem statement. Implement a gradient accumulation wrapper. Given a model with .forward(x), .backward(loss), and a .params dict with associated .grads, accumulate gradients over n_steps mini-batches before applying one optimizer step.

Why they’re asking. Gradient accumulation is the standard technique for training with effective batch sizes that exceed GPU memory. The interviewer is checking: (a) the candidate zeroes gradients before accumulation starts, (b) the loss is scaled by 1/n_steps so the average matches a true large-batch step, and (c) the optimizer step fires exactly once per accumulation window.

class GradAccumulator:
    """Accumulate gradients over n_steps before one optimizer step."""

    def __init__(self, model, optimizer, n_steps: int):
        self.model = model
        self.optimizer = optimizer
        self.n = n_steps
        self._step = 0
        self._zero_grads()

    def _zero_grads(self):
        for name in self.model.params:
            self.model.grads[name] = 0.0

    def step(self, x, y):
        """Returns loss (scaled). Applies optimizer every n_steps calls."""
        logits = self.model.forward(x)
        loss = self.model.loss_fn(logits, y)

        # Scale loss so accumulated gradient == gradient of average loss
        scaled_loss = loss / self.n
        self.model.backward(scaled_loss)

        self._step += 1
        if self._step % self.n == 0:
            self.optimizer.step(self.model.params, self.model.grads)
            self._zero_grads()

        return float(loss)

The common bugs.

Forgetting to zero gradients at the start of each accumulation window — gradients pile up across windows, producing incorrect updates.
Not scaling the loss by 1/n_steps — the accumulated gradient becomes n times the correct value, which destabilizes training.
Calling the optimizer step on every mini-batch — gradient accumulation’s entire purpose is to fire the optimizer once per window.

Stretch follow-ups.

“How does gradient accumulation interact with batch normalization?” BatchNorm statistics are computed per mini-batch, not per accumulation window — BatchNorm and gradient accumulation are fundamentally incompatible. Use LayerNorm.
“What is the equivalent in PyTorch?” loss.backward() accumulates into .grad by default; call optimizer.zero_grad() only at the start of the window, optimizer.step() only at the end.

§124.14 Brute-force vector search with cosine similarity

Problem statement. Given a matrix of stored vectors index (shape [n, d]) and a query vector q (shape [d]), return the indices of the top-k most similar vectors by cosine similarity.

Why they’re asking. Brute-force search is the correctness baseline every ANN index is tested against. The interviewer wants: (a) correct L2 normalization (cosine = dot product of unit vectors), (b) efficient np.argsort usage (not sorting the full list just to take k), and (c) awareness of when brute force is the right choice (n < 10k, high-frequency updates).

import numpy as np

def cosine_top_k(
    index: np.ndarray,      # [n, d]
    q: np.ndarray,          # [d]
    k: int,
) -> np.ndarray:            # [k] indices, best first
    # Normalize
    index_norm = index / (np.linalg.norm(index, axis=1, keepdims=True) + 1e-8)
    q_norm     = q / (np.linalg.norm(q) + 1e-8)

    # Cosine similarity = dot product of unit vectors
    scores = index_norm @ q_norm        # [n]

    # argpartition to find top-k in O(n) then sort just those k
    if k >= len(scores):
        top_k_idx = np.arange(len(scores))
    else:
        top_k_idx = np.argpartition(scores, -k)[-k:]

    # Sort the k candidates descending by score
    top_k_idx = top_k_idx[np.argsort(scores[top_k_idx])[::-1]]
    return top_k_idx

The common bugs.

Using np.argsort(scores)[-k:] — this sorts all n scores; np.argpartition is O(n) vs O(n log n) and a visible win for large n.
Not normalizing before the dot product — unnormalized dot product is inner product, not cosine similarity.
Division by zero for zero-norm vectors — adding a small eps to the denominator handles this.

Stretch follow-ups.

“What is the complexity, and when does it become unacceptable?” O(n × d) for the matmul. Acceptable up to roughly n = 100k at d = 768 on a modern CPU. Beyond that, use HNSW or FAISS IVF.
“How do you handle vectors that were added after the initial build?” Brute-force handles it trivially by appending to the index matrix. HNSW requires a graph insert; IVF requires coarse assignment.

§124.15 HNSW search

Problem statement. Given a pre-built HNSW graph (a dict from node ID to dict of layer → list of neighbor IDs), and a distance function, implement the greedy search from the entry point down to layer 0, returning the top-k nearest neighbors.

Why they’re asking. HNSW is the dominant algorithm behind every production vector index (FAISS, Qdrant, Weaviate, pgvector). The interview tests whether the candidate understands layered structure (greedy descent at upper layers, beam-expanded search at layer 0) — not just that “HNSW is fast.”

import numpy as np
import heapq
from typing import Dict, List, Callable

def hnsw_search(
    graph: Dict[int, Dict[int, List[int]]],  # node -> layer -> [neighbor_ids]
    vectors: np.ndarray,                     # [n, d] all node vectors
    query: np.ndarray,                       # [d]
    entry_point: int,
    num_layers: int,
    ef: int,                                 # search width at layer 0
    k: int,
) -> List[int]:
    """Greedy HNSW search. Returns k nearest neighbor IDs."""
    def dist(a_id: int) -> float:
        return float(np.linalg.norm(vectors[a_id] - query))

    current = entry_point

    # Phase 1: greedy descent through upper layers (L down to 1)
    for layer in range(num_layers, 0, -1):
        improved = True
        while improved:
            improved = False
            neighbors = graph.get(current, {}).get(layer, [])
            for nb in neighbors:
                if dist(nb) < dist(current):
                    current = nb
                    improved = True

    # Phase 2: beam search at layer 0 with ef candidates
    candidates = [(dist(current), current)]     # min-heap by distance
    visited    = {current}
    result     = [(dist(current), current)]     # best k found

    while candidates:
        d_c, c = heapq.heappop(candidates)
        # Prune: if the closest candidate is worse than the kth result, stop
        if result and d_c > sorted(result)[min(k - 1, len(result) - 1)][0]:
            break
        for nb in graph.get(c, {}).get(0, []):
            if nb not in visited:
                visited.add(nb)
                d_nb = dist(nb)
                heapq.heappush(candidates, (d_nb, nb))
                result.append((d_nb, nb))

    result.sort()
    return [node for _, node in result[:k]]

HNSW descends greedily through sparse upper layers to find a good starting point, then expands with ef candidates at the dense bottom layer — separating coarse navigation from fine search.

The common bugs.

Running the beam-expanded search at every layer — the whole point of upper layers is a single greedy hop per node.
Pruning the layer-0 search too aggressively by setting ef = k — ef should be larger than k (typically 2-10×) to improve recall.
Not tracking visited — without it, the graph walk can revisit nodes exponentially.

Stretch follow-ups.

“What are M and ef_construction?” M is the number of neighbors per node during graph construction; ef_construction is the search width used during insert. Both trade index build time and memory against recall.
“How does Qdrant use HNSW differently from FAISS HNSW?” Qdrant exposes per-collection quantization and per-segment HNSW graphs; FAISS treats the whole index as one flat structure.

§124.16 BM25 scorer

Problem statement. Implement BM25 scoring. Given a collection of documents and a query, score all documents against the query.

Why they’re asking. BM25 is the statistical backbone of hybrid search. Every RAG system that combines dense and sparse retrieval uses BM25 as the sparse component. The interviewer wants to see: (a) the IDF formula (with or without the +1 smoothing — note which one), (b) the TF saturation term (k1 + 1) in the numerator, and (c) the length normalization.

import math
from collections import defaultdict
from typing import List, Dict

def build_bm25_index(docs: List[List[str]]) -> Dict:
    """docs: list of tokenized documents (list of word strings)."""
    N = len(docs)
    df: Dict[str, int] = defaultdict(int)         # doc frequency
    tf: List[Dict[str, int]] = []                  # term frequency per doc
    lengths = []

    for doc in docs:
        lengths.append(len(doc))
        tf_doc: Dict[str, int] = defaultdict(int)
        for term in doc:
            tf_doc[term] += 1
        tf.append(tf_doc)
        for term in set(doc):
            df[term] += 1

    avg_dl = sum(lengths) / N

    return {"N": N, "df": df, "tf": tf, "lengths": lengths, "avg_dl": avg_dl}

def bm25_score(
    idx: Dict,
    query_terms: List[str],
    k1: float = 1.5,
    b: float = 0.75,
) -> List[float]:
    """Return BM25 score for every document."""
    N, df, tf_list = idx["N"], idx["df"], idx["tf"]
    avg_dl, lengths = idx["avg_dl"], idx["lengths"]
    scores = []

    for i, tf_doc in enumerate(tf_list):
        score = 0.0
        dl = lengths[i]
        for term in query_terms:
            if term not in df:
                continue
            idf = math.log((N - df[term] + 0.5) / (df[term] + 0.5) + 1.0)
            tf_val = tf_doc.get(term, 0)
            tf_norm = tf_val * (k1 + 1) / (tf_val + k1 * (1 - b + b * dl / avg_dl))
            score += idf * tf_norm
        scores.append(score)

    return scores

The common bugs.

Using log(N / df) instead of the Robertson-Spärck Jones IDF with the +0.5 smoothing — the smoothed version handles terms that appear in every document without going to zero.
Forgetting the (k1 + 1) in the numerator of the TF term — without it, the saturation behavior is wrong.
Using raw TF instead of normalized TF — raw TF rewards longer documents; the (1 - b + b * dl / avg_dl) term is the length normalization.

Stretch follow-ups.

“How do you combine BM25 and dense cosine scores?” Reciprocal Rank Fusion (RRF) or linear combination with learned weights. RRF is score-scale-agnostic, which makes it robust.
“What is BM25F?” An extension that applies different field weights and length normalizations to structured fields (title, body, metadata).

§124.17 Token-bucket rate limiter

Problem statement. Implement a token-bucket rate limiter. Support capacity tokens, a refill rate of rate tokens per second, and a consume(n) method that returns True if the request is allowed.

Why they’re asking. Rate limiting is a first-class concern in LLM APIs — tokens per minute, requests per second, per-user budgets. The interview checks: (a) monotonic clock usage (not time.time in production but acceptable here), (b) refill-on-demand (lazy computation instead of a background thread), and (c) correct clamping to capacity.

import time

class TokenBucket:
    """Token bucket rate limiter. Thread-safe variant would add a lock."""

    def __init__(self, capacity: float, rate: float):
        """capacity: max tokens. rate: tokens refilled per second."""
        self.capacity = capacity
        self.rate     = rate
        self.tokens   = float(capacity)     # start full
        self.last_refill = time.monotonic()

    def _refill(self):
        now     = time.monotonic()
        elapsed = now - self.last_refill
        # Add tokens earned since last call, but no more than capacity
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last_refill = now

    def consume(self, n: float = 1.0) -> bool:
        """Returns True if n tokens are available (and consumes them)."""
        self._refill()
        if self.tokens >= n:
            self.tokens -= n
            return True
        return False

The common bugs.

Using a background thread to refill — unnecessary; lazy refill on every consume is correct and simpler.
Using time.time() — wall clock can go backwards (NTP corrections). time.monotonic() is the correct choice.
Not clamping to capacity during refill — uncapped refill lets tokens accumulate across idle periods, creating burst spikes larger than the intended capacity.

Stretch follow-ups.

“How do you make this distributed (across multiple API gateway instances)?” Use Redis INCR + EXPIRE for a sliding window, or use Redis with a Lua script that atomically checks and decrements the bucket.
“What is the difference between token bucket and leaky bucket?” Token bucket allows burst up to capacity; leaky bucket enforces a strictly constant output rate.

§124.18 LRU cache

Problem statement. Implement an LRU cache with a fixed capacity. Support get(key) (return -1 if missing) and put(key, value).

Why they’re asking. LRU cache is the classic interview question, but in ML systems interviews it shows up concretely: embedding caches, KV prefix caches, prompt caches. The interviewer wants OrderedDict used fluently — the candidate who reaches for a doubly-linked-list + hashmap is doing more work than necessary.

from collections import OrderedDict
from typing import Any, Optional

class LRUCache:
    """O(1) get and put using OrderedDict's move_to_end."""

    def __init__(self, capacity: int):
        self.cap   = capacity
        self.cache: OrderedDict = OrderedDict()

    def get(self, key: Any) -> Optional[Any]:
        if key not in self.cache:
            return None
        self.cache.move_to_end(key)   # mark as recently used
        return self.cache[key]

    def put(self, key: Any, value: Any):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.cap:
            self.cache.popitem(last=False)   # evict LRU (head of OrderedDict)

After get(D), D moves to the MRU end; A is now the least-recently-used entry and will be evicted on the next put() that exceeds capacity.

The common bugs.

Forgetting move_to_end on a get hit — the cache becomes FIFO, not LRU.
Using popitem(last=True) for eviction — this evicts the MRU (most recent), the opposite of the intended behavior.
Not handling the case where a key being put already exists — without move_to_end before the assignment, the existing key’s order does not update.

Stretch follow-ups.

“How do you make this thread-safe?” Wrap get and put with threading.Lock(). For high contention, use a striped lock (sharded by key hash).
“How does vLLM’s prefix cache differ?” It’s block-granular — cache entries are blocks of KV cache for a prefix hash, not individual embeddings. Eviction is by LRU but operates on blocks.

§124.19 Speculative decoding verification loop

Problem statement. Implement the verification loop for speculative decoding. Given: draft_tokens (list of K token IDs proposed by the draft model), target_logits (shape [K+1, vocab_size] — the target model’s logits for each draft token position plus one more), return the accepted prefix and, if a rejection occurred, a resampled replacement token.

Why they’re asking. Speculative decoding is the inference speedup technique interviewers most frequently probe in depth. The verification loop is the algorithmic core. The candidate must show: (a) the rejection sampling criterion (accept token t if target_prob[t] / draft_prob[t] >= Uniform(0,1)), (b) the adjusted distribution on rejection, and (c) why all K tokens can be verified in a single target model forward pass.

import numpy as np
from typing import List, Tuple

def verify_speculative(
    draft_tokens: List[int],                 # [K] token IDs from draft model
    draft_probs: np.ndarray,                 # [K, V] — draft prob for each position
    target_logits: np.ndarray,               # [K+1, V] — target model output
) -> Tuple[List[int], int]:
    """
    Returns (accepted_tokens, bonus_token).
    accepted_tokens: the prefix of draft_tokens the target accepts (may be empty).
    bonus_token: one additional token sampled from the adjusted distribution.
    """
    K, V = draft_probs.shape

    def softmax(x):
        x = x - x.max()
        e = np.exp(x)
        return e / e.sum()

    target_probs = np.array([softmax(target_logits[i]) for i in range(K + 1)])
    # [K+1, V]

    accepted = []
    for i, t in enumerate(draft_tokens):
        p_target = target_probs[i, t]
        p_draft  = draft_probs[i, t]

        # Accept with probability min(1, target/draft)
        r = np.random.uniform()
        if r <= p_target / (p_draft + 1e-9):
            accepted.append(t)
        else:
            # Rejection: resample from the residual distribution
            residual = np.maximum(target_probs[i] - draft_probs[i], 0.0)
            residual /= residual.sum() + 1e-9
            bonus = int(np.random.choice(V, p=residual))
            return accepted, bonus

    # All K tokens accepted; also sample the bonus token from position K
    bonus = int(np.random.choice(V, p=target_probs[K]))
    return accepted, bonus

The target model processes all K draft tokens in one forward pass at full parallelism; sequential cost is only the K cheap draft-model steps — accepted tokens are free from the target's perspective.

The common bugs.

Rejecting based on target_prob < draft_prob (a threshold comparison) instead of the acceptance ratio min(1, target/draft) — the ratio is the mathematically correct criterion that preserves the target distribution.
Not computing the residual distribution on rejection — simply sampling from target_probs[i] produces correct distribution but forfeits the lossless guarantee; the residual max(p_target - p_draft, 0) is necessary.
Using K target logits instead of K+1 — the bonus token at position K requires a logit beyond the last draft token.

Stretch follow-ups.

“What acceptance rate makes speculative decoding worthwhile?” If the draft model accepts at rate α, the expected accepted tokens per target forward pass is K·α + 1. Break-even is roughly α > (target_cost - draft_cost×K) / target_cost.
“What is Medusa?” It replaces the draft model with multiple independent prediction heads on the same model — eliminating the separate draft model at the cost of slightly lower acceptance rates.

§124.20 Naive ring all-reduce

Problem statement. Implement a simple ring all-reduce for gradient synchronization across n workers. Each worker has a numpy array of gradients. After the all-reduce, every worker should hold the sum of all workers’ arrays.

Why they’re asking. All-reduce is the communication primitive behind every data-parallel training job. The naive ring algorithm has O(n) bandwidth cost per worker (vs O(n) for a naive all-gather) and O(1) additional memory. The interviewer wants: (a) the two-phase ring structure (reduce-scatter then all-gather), and (b) awareness that this is a simulation — real all-reduce uses NCCL, which pipelines across chunks.

import numpy as np
from typing import List

def ring_all_reduce(workers: List[np.ndarray]) -> List[np.ndarray]:
    """
    Naive ring all-reduce (in-process simulation).
    workers: list of n arrays, each shape [d]. Returns list of n arrays
    each containing the element-wise sum.

    Two phases:
    1. Reduce-scatter: each worker accumulates a shard from its left neighbor.
    2. All-gather: each worker broadcasts its fully-reduced shard.
    """
    n = len(workers)
    d = workers[0].shape[0]
    assert all(w.shape == (d,) for w in workers)

    # Working copy so we don't mutate inputs
    bufs = [w.copy() for w in workers]
    chunk = d // n  # assume d divisible by n for simplicity

    # --- Phase 1: reduce-scatter (n-1 rounds) ---
    for step in range(n - 1):
        for i in range(n):
            src  = (i - step - 1) % n          # who sends to me this round
            shard = src % n                     # which shard flows this round
            s = shard * chunk
            e = s + chunk
            bufs[i][s:e] += workers[src][s:e]  # accumulate from left

    # --- Phase 2: all-gather (n-1 rounds) ---
    for step in range(n - 1):
        for i in range(n):
            src   = (i - step - 1) % n
            shard = (i - step) % n
            s = shard * chunk
            e = s + chunk
            bufs[i][s:e] = bufs[src][s:e]      # copy fully-reduced shard

    return bufs


# Verify: every worker should hold the element-wise sum
np.random.seed(0)
workers = [np.random.randn(8) for _ in range(4)]
expected_sum = sum(workers)
results = ring_all_reduce(workers)
for r in results:
    assert np.allclose(r, expected_sum, atol=1e-6), "all-reduce incorrect"

The common bugs.

Running only one phase — reduce-scatter alone gives each worker one correct shard, not the full sum; all-gather is required.
Wrong ring direction: both phases must use the same direction (left or right); mixing them produces garbage.
Mutating the input workers list during accumulation — always work on a copy so each step reads clean source values.

Stretch follow-ups.

“Why is ring all-reduce preferred over a parameter server?” A parameter server creates O(n) traffic on the server; ring all-reduce distributes traffic evenly across all workers, saturating the interconnect equally.
“What does NCCL add on top of this?” Hardware-aware chunk pipelining (overlaps compute and communication), NVLink and InfiniBand topology awareness, and kernels that operate directly in GPU HBM without CPU involvement.

Read it yourself

Andrej Karpathy, nanoGPT (GitHub). The cleanest single-file reference for attention, multi-head, cross-entropy, and the autoregressive loop — reads in under an hour.
Sebastian Raschka, Build a Large Language Model (From Scratch). Chapter-by-chapter Python implementations of every algorithm in this chapter.
The BPE paper: Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units” (2016) — the original merge-table algorithm.
Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models” (2021) — the B=0 initialization is in Section 4.1.
Leviathan et al., “Fast Inference from Transformers via Speculative Decoding” (2023) — the acceptance criterion derivation is in Section 2.3.
Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding” (2021) — the pair-wise rotation derivation from first principles.
The HNSW paper: Malkov & Yashunin, “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs” (2018) — search algorithm in Algorithm 5.

Practice

Implement all twenty algorithms from memory, one per day, without reference. Time yourself; target under 15 minutes per algorithm.
For each algorithm, write one unit test that catches the most common bug listed in its section. Run it before you consider the implementation done.
Pick five algorithms and explain, out loud, the numerical or algorithmic trap that makes the naive implementation wrong. Practice until the explanation takes under 60 seconds.
Chain §124.7 (KV cache) and §124.2 (attention) into a working decode loop: given a sequence of embeddings, produce one new token per step using the cached KV.
Combine §124.14 (brute-force cosine search) and §124.16 (BM25) into a simple hybrid scorer using reciprocal rank fusion. Test that the fused scores differ from either alone.
Stretch: Implement §124.20 (ring all-reduce) with n=8 workers and verify the result against a simple sum(workers). Then replace the in-process simulation with socket-based communication between threads.
Stretch: Take §124.19 (speculative decoding) and wire it to a real HuggingFace model as the target and a smaller model as the draft. Measure the wall-clock speedup as a function of K and the acceptance rate.