The coding interview: twenty ML systems algorithms
"You can fake a system design. You cannot fake a working softmax"
Every Part X chapter so far has focused on the design-round format — clarify, estimate, design, drill, ops. But many loops at senior levels include a 45-minute coding round that sits next to the system design, and it tests a tighter scope: can you implement, from scratch and without documentation, the algorithms that make the system work? The ten minutes you spend sketching a KV cache in the design round will be followed, at some companies, by a coding round that says “now implement it.”
The twenty algorithms in this chapter are the ones that show up most often. They are not data-structure puzzles. They are ML primitives — the numerical routines, the serving logic, the retrieval plumbing — that a senior ML engineer should be able to write correctly under pressure. Getting them right demonstrates something that a whiteboard diagram never can: you understand the operation at the level of shapes, dtypes, numerical edge cases, and off-by-one errors.
The interviewers grading this round are not looking for elegance. They are looking for three things: correct behavior on the normal case, awareness of the numerical traps, and knowledge of what to check when the output looks wrong. A candidate who writes a slightly verbose but numerically stable softmax and immediately explains the max-subtract trick is a candidate who has shipped code. A candidate who writes the naive version and says “I think this is fine” is a candidate who has read slides.
§124.1 Numerically stable softmax
Problem statement. Implement softmax over a 1-D array of logits. Then implement log-softmax (for use in cross-entropy), and explain why numerical stability matters.
Why they’re asking. Softmax is the first filter. Every candidate writes it; the question is whether they write the version that explodes on large logits. The interviewer wants to see (a) the max-subtract trick stated before code is written, not discovered after the output is nan, and (b) the log-sum-exp formulation for computing log-probabilities without materializing the full softmax.
import numpy as np
def softmax(x: np.ndarray) -> np.ndarray:
"""Numerically stable softmax over last axis."""
# Subtract max to prevent exp overflow (no change to output — divides out)
x = x - x.max(axis=-1, keepdims=True)
e = np.exp(x)
return e / e.sum(axis=-1, keepdims=True)
def log_softmax(x: np.ndarray) -> np.ndarray:
"""log(softmax(x)) without materializing softmax — avoids log(0)."""
x_max = x.max(axis=-1, keepdims=True)
log_sum_exp = np.log(np.exp(x - x_max).sum(axis=-1, keepdims=True)) + x_max
return x - log_sum_exp
# Quick sanity checks
logits = np.array([1.0, 2.0, 3.0])
probs = softmax(logits)
assert abs(probs.sum() - 1.0) < 1e-6
assert np.allclose(np.log(probs), log_softmax(logits), atol=1e-6)
# Large logit case — naive exp would overflow to inf
large = np.array([1000.0, 1001.0, 1002.0])
assert not np.any(np.isnan(softmax(large)))
The common bugs.
- Subtracting mean instead of max. Mean-subtract is not wrong per se but does not prevent overflow; max-subtract does.
- Writing
np.exp(x).sum()beforex - x.max(). The subtraction must happen first. - Using
log(softmax(x))for cross-entropy loss instead oflog_softmax. Squashing probabilities close to zero throughlogafter softmax loses precision;log_softmaxkeeps it.
Stretch follow-ups.
- “Now make this work on a batch of logits (shape
[B, V]) without a loop.” Thekeepdims=Truealready handles it; note the candidate must confirm axes. - “What happens to softmax when two logits are equal and very large?” Numerically fine after max-subtract; still worth saying out loud.
§124.2 Scaled dot-product attention
Problem statement. Given query Q, key K, and value V matrices (shapes [seq, d_k], [seq, d_k], [seq, d_v]), implement scaled dot-product attention. Add an optional boolean causal mask.
Why they’re asking. This is the kernel of every transformer. The interviewer will let the candidate write it and then ask about shapes — wrong shapes are the most common failure. They will also probe the sqrt(d_k) denominator: can the candidate explain why it’s there (variance preservation under random initialization), or do they just copy it?
import numpy as np
def scaled_dot_product_attention(
Q: np.ndarray, # [seq_q, d_k]
K: np.ndarray, # [seq_k, d_k]
V: np.ndarray, # [seq_k, d_v]
causal: bool = False,
) -> np.ndarray: # [seq_q, d_v]
d_k = Q.shape[-1]
# Scores: [seq_q, seq_k]
scores = Q @ K.T / np.sqrt(d_k)
if causal:
seq_q, seq_k = scores.shape
# Upper-triangular mask (positions i should not attend to j > i)
mask = np.triu(np.ones((seq_q, seq_k), dtype=bool), k=1)
scores = np.where(mask, -1e9, scores)
weights = softmax(scores) # [seq_q, seq_k]
return weights @ V # [seq_q, d_v]
def softmax(x):
x = x - x.max(axis=-1, keepdims=True)
e = np.exp(x)
return e / e.sum(axis=-1, keepdims=True)
The common bugs.
- Transposing Q instead of K:
K @ Q.Tis wrong for non-square queries. - Using
k=0innp.triufor the causal mask, which masks the diagonal and prevents each position from attending to itself. - Forgetting to apply softmax before the final matmul — a surprisingly common error under pressure.
Stretch follow-ups.
- “What if
seq_q != seq_k? When does that happen?” Cross-attention in encoder-decoder models; also prefill vs decode in KV cache. - “How does the mask change for padding tokens vs causal?” Padding masks zero out specific positions; the candidate should show
np.where(pad_mask, -1e9, scores).
§124.3 Multi-head attention
Problem statement. Implement multi-head attention. You have weight matrices W_Q, W_K, W_V (each [d_model, d_model]) and W_O ([d_model, d_model]). Accept x of shape [seq, d_model].
Why they’re asking. Multi-head attention is where shape bugs flourish. The reshape-and-transpose dance ([seq, n_heads, d_k] → [n_heads, seq, d_k]) trips up everyone who has only seen it in high-level APIs. The interviewer wants to see the reshape done cleanly and the output concatenation done correctly.
import numpy as np
def multi_head_attention(
x: np.ndarray, # [seq, d_model]
W_Q: np.ndarray, # [d_model, d_model]
W_K: np.ndarray, # [d_model, d_model]
W_V: np.ndarray, # [d_model, d_model]
W_O: np.ndarray, # [d_model, d_model]
n_heads: int,
causal: bool = False,
) -> np.ndarray: # [seq, d_model]
seq, d_model = x.shape
d_k = d_model // n_heads
assert d_model % n_heads == 0
# Project
Q = x @ W_Q # [seq, d_model]
K = x @ W_K
V = x @ W_V
# Split into heads: [seq, d_model] -> [n_heads, seq, d_k]
def split_heads(t):
t = t.reshape(seq, n_heads, d_k) # [seq, n_heads, d_k]
return t.transpose(1, 0, 2) # [n_heads, seq, d_k]
Q_h, K_h, V_h = split_heads(Q), split_heads(K), split_heads(V)
# Attend per head
heads = []
for h in range(n_heads):
heads.append(scaled_dot_product_attention(Q_h[h], K_h[h], V_h[h], causal))
# Concat heads: [n_heads, seq, d_k] -> [seq, d_model]
out = np.concatenate(heads, axis=-1) # [seq, d_model]
return out @ W_O
def scaled_dot_product_attention(Q, K, V, causal=False):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
if causal:
mask = np.triu(np.ones(scores.shape, dtype=bool), k=1)
scores = np.where(mask, -1e9, scores)
e = np.exp(scores - scores.max(axis=-1, keepdims=True))
w = e / e.sum(axis=-1, keepdims=True)
return w @ V
The common bugs.
- Transposing to
[seq, n_heads, d_k]and iterating over the wrong axis — always transpose to[n_heads, seq, d_k]before iterating. - Concatenating on axis 0 (heads axis) instead of axis -1 (feature axis).
- Forgetting
W_O— the output projection is not optional; without it, heads cannot interact.
Stretch follow-ups.
- “Vectorize away the head loop with a batched matmul.” The candidate should sketch
np.einsumor batched@over the[n_heads, seq, d_k]tensor. - “What is GQA?” Multiple query heads share one key-value head —
K_handV_hare replicated across a group of query heads.
§124.4 BPE tokenizer training loop
Problem statement. Implement the training loop for Byte-Pair Encoding (BPE). Input: a list of pre-tokenized strings (already split on whitespace). Target: a merge table of num_merges merges.
Why they’re asking. BPE training is a classic pair-frequency problem. It exposes whether the candidate can maintain a correct frequency count under iterative merging without restarting from scratch each round. The weak answer rebuilds the frequency table from scratch every merge; the strong answer updates it incrementally.
from collections import defaultdict
from typing import List, Tuple, Dict
def train_bpe(
corpus: List[str], # ["hello world", "world peace", ...]
num_merges: int,
) -> List[Tuple[str, str]]:
"""Return the list of (a, b) merge rules in order of application."""
# Represent each word as a tuple of characters + end-of-word marker
vocab: Dict[Tuple[str, ...], int] = defaultdict(int)
for sentence in corpus:
for word in sentence.split():
vocab[tuple(word) + ("</w>",)] += 1
merges: List[Tuple[str, str]] = []
for _ in range(num_merges):
# Count all adjacent pairs
pair_freq: Dict[Tuple[str, str], int] = defaultdict(int)
for word, freq in vocab.items():
for a, b in zip(word, word[1:]):
pair_freq[(a, b)] += freq
if not pair_freq:
break
# Pick the most frequent pair
best = max(pair_freq, key=pair_freq.get)
merges.append(best)
# Merge every occurrence of best in vocab
new_vocab: Dict[Tuple[str, ...], int] = {}
a, b = best
for word, freq in vocab.items():
new_word = []
i = 0
while i < len(word):
if i < len(word) - 1 and word[i] == a and word[i + 1] == b:
new_word.append(a + b)
i += 2
else:
new_word.append(word[i])
i += 1
new_vocab[tuple(new_word)] = freq
vocab = new_vocab
return merges
The common bugs.
- Iterating over characters instead of symbols — after merges, symbols are multi-character; iterating over characters restarts from scratch.
- Off-by-one in the merge loop: missing the last pair when
zipstops one short —zip(word, word[1:])is correct;range(len(word) - 1)is the safe alternative. - Rebuilding
pair_freqby scanning the original string corpus instead of the currentvocabrepresentation — this ignores all previous merges.
Stretch follow-ups.
- “How does inference-time encoding (encoding a new string with a trained BPE model) work?” Apply merges in order; the longest-match greedy approach is the standard.
- “Why does tiktoken use a regex pre-tokenizer before BPE?” To prevent merges across word boundaries, digits, and punctuation — producing more predictable token boundaries.
§124.5 Top-k / top-p (nucleus) sampling
Problem statement. Given a 1-D array of logits, implement a function that returns a sampled token index using top-k filtering, top-p (nucleus) filtering, and temperature — all in one function, applied in that order.
Why they’re asking. Sampling is the last step before text appears, and every combination of flags produces a different distribution. The interviewer wants to see: (a) temperature applied before masking, (b) top-k mask applied before top-p, (c) correct handling of the edge case where top-p cuts out the only remaining token.
import numpy as np
def sample_token(
logits: np.ndarray, # [vocab_size]
temperature: float = 1.0,
top_k: int = 0, # 0 = disabled
top_p: float = 1.0, # 1.0 = disabled
) -> int:
# 1. Temperature scaling
logits = logits / max(temperature, 1e-8)
# 2. Top-k: zero out all but the k largest logits
if top_k > 0:
k = min(top_k, len(logits))
threshold = np.sort(logits)[-k]
logits = np.where(logits < threshold, -1e9, logits)
# 3. Convert to probabilities
probs = softmax(logits)
# 4. Top-p (nucleus): find the smallest set of tokens
# whose cumulative probability >= top_p
if top_p < 1.0:
sorted_idx = np.argsort(probs)[::-1] # highest first
sorted_probs = probs[sorted_idx]
cumulative = np.cumsum(sorted_probs)
# Remove tokens once cumulative > top_p (but keep at least one)
cutoff = np.searchsorted(cumulative, top_p, side="right")
cutoff = max(cutoff, 0) # keep at least 1 token
removed = sorted_idx[cutoff + 1:]
probs[removed] = 0.0
probs /= probs.sum() # renormalize
# 5. Sample
return int(np.random.choice(len(probs), p=probs))
def softmax(x):
x = x - x.max()
e = np.exp(x)
return e / e.sum()
The common bugs.
- Applying top-p before converting to probabilities — top-p is defined in probability space, not logit space.
- Off-by-one in the nucleus cutoff:
cumulative > top_premoves the token that pushes cumulative overtop_p, but that token should be included. - Division by zero if all probs are zeroed out by aggressive top-k + top-p combination — always keep at least one token.
Stretch follow-ups.
- “What is min-p sampling?” Remove any token with probability less than
min_p * max_prob. Simpler than top-p and avoids the cumsum. - “Why is temperature 0 not truly deterministic in practice?” Hardware non-determinism in floating-point reduction order;
argmax(greedy) is the true deterministic baseline.
§124.6 Beam search
Problem statement. Implement beam search. Given a function score_fn(token_ids) -> np.ndarray[vocab_size] that returns log-probabilities for the next token, produce the top-beam_width complete sequences up to max_len tokens.
Why they’re asking. Beam search is the canonical dynamic-programming decoding algorithm. The interviewer wants to see correct log-prob accumulation (addition, not multiplication), proper termination on EOS, and no hidden bugs in beam expansion order.
import numpy as np
from typing import Callable, List, Tuple
EOS = 1 # assume token ID 1 is end-of-sequence
def beam_search(
score_fn: Callable, # score_fn(list[int]) -> np.ndarray[vocab_size] (log-probs)
beam_width: int,
max_len: int,
vocab_size: int,
bos_id: int = 0,
) -> List[Tuple[float, List[int]]]:
"""Return list of (score, token_ids) sorted best first."""
# Each beam: (cumulative_log_prob, token_ids)
beams: List[Tuple[float, List[int]]] = [(0.0, [bos_id])]
completed: List[Tuple[float, List[int]]] = []
for _ in range(max_len):
candidates: List[Tuple[float, List[int]]] = []
for log_prob, tokens in beams:
if tokens[-1] == EOS:
completed.append((log_prob, tokens))
continue
# Get log-probs for next token
next_log_probs = score_fn(tokens) # [vocab_size]
# Expand: pick top beam_width tokens to keep search tractable
top_k = min(beam_width, vocab_size)
top_ids = np.argsort(next_log_probs)[-top_k:]
for t in top_ids:
candidates.append((
log_prob + next_log_probs[t],
tokens + [int(t)],
))
if not candidates:
break
# Keep only the best beam_width live beams
candidates.sort(key=lambda x: x[0], reverse=True)
beams = candidates[:beam_width]
# Collect any beams that did not hit EOS
completed.extend(beams)
completed.sort(key=lambda x: x[0], reverse=True)
return completed[:beam_width]
The common bugs.
- Accumulating probabilities multiplicatively instead of log-probs additively — multiplicative scores underflow to zero within a few tokens.
- Treating a beam that reached EOS as still live — it should move to the
completedlist, not be expanded. - Expanding all
beam_widthbeams tovocab_sizecandidates without truncating tobeam_widthbefore keeping — this creates O(beam_width × vocab_size) candidates; the sort + truncate step is mandatory.
Stretch follow-ups.
- “Why did beam search fall out of favor for LLMs?” It degrades with large models because the highest-probability sequences tend to be degenerate repetitions; sampling with temperature works better in practice.
- “What is length normalization?” Divide cumulative log-prob by sequence length raised to a penalty exponent to prevent short sequences from dominating.
§124.7 KV cache
Problem statement. Implement an append-only KV cache for single-token decoding. The cache holds previously computed keys and values. On each decode step, append the new key and value, then run attention over cache + current query.
Why they’re asking. The KV cache is the primary memory structure in autoregressive inference. The interviewer is checking: (a) the candidate knows it is append-only (no recomputation), (b) the candidate correctly handles the query being length-1 while keys/values grow, and (c) the candidate knows the cache’s memory cost in bytes.
import numpy as np
from typing import Optional
class KVCache:
"""Single-layer, single-head KV cache for autoregressive decoding."""
def __init__(self, d_k: int, d_v: int, max_seq: int):
# Pre-allocate; avoids repeated reallocation
self.K = np.zeros((max_seq, d_k), dtype=np.float32)
self.V = np.zeros((max_seq, d_v), dtype=np.float32)
self.length = 0
def append(self, k: np.ndarray, v: np.ndarray):
"""k, v shapes: [d_k], [d_v] — single token."""
assert self.length < self.K.shape[0], "cache full"
self.K[self.length] = k
self.V[self.length] = v
self.length += 1
def attend(self, q: np.ndarray) -> np.ndarray:
"""q shape: [d_k]. Returns context vector [d_v]."""
K = self.K[:self.length] # [seq_so_far, d_k]
V = self.V[:self.length] # [seq_so_far, d_v]
scores = K @ q / np.sqrt(q.shape[-1]) # [seq_so_far]
scores = scores - scores.max()
w = np.exp(scores)
w /= w.sum() # softmax weights
return w @ V # [d_v]
def decode_step(
x_embed: np.ndarray, # [d_model] — embedding of the new token
W_Q, W_K, W_V, # [d_model, d_model] each
cache: KVCache,
) -> np.ndarray: # [d_v] — context for this step
q = x_embed @ W_Q # [d_k]
k = x_embed @ W_K
v = x_embed @ W_V
cache.append(k, v)
return cache.attend(q)
The common bugs.
- Running attention over only the current token’s K/V instead of the full cache — this negates the point of the cache.
- Growing the cache as a Python list and converting to numpy each step — pre-allocation with a length counter is O(1) per append.
- Forgetting that the causal mask is trivially satisfied at decode time: every previously cached position is already in the past.
Stretch follow-ups.
- “What is PagedAttention?” It replaces the contiguous pre-allocated buffer with block-paged storage so KV memory can be shared across requests and reclaimed on completion.
- “How does the cache scale with batch size and sequence length?” Bytes =
batch × seq × n_layers × 2 × d_model × sizeof(dtype). For a 70B model: 80 layers × 2 × 8192 × 2 bytes (bf16) × seq × batch — quickly gigabytes.
§124.8 LoRA forward pass
Problem statement. Implement the LoRA forward pass. A frozen weight matrix W ([out_dim, in_dim]) is adapted by two low-rank matrices A ([r, in_dim]) and B ([out_dim, r]). Scaling factor is alpha / r. Input x is [in_dim].
Why they’re asking. LoRA is the dominant fine-tuning method in production. The interviewer wants to verify: (a) the candidate can write the additive residual correctly, (b) scaling by alpha/r is applied to the low-rank term (not W), and (c) the candidate knows why B is initialized to zeros at the start of training.
import numpy as np
class LoRALinear:
"""LoRA-adapted linear layer. W is frozen; A and B are trained."""
def __init__(
self,
W: np.ndarray, # [out_dim, in_dim], frozen
r: int, # rank
alpha: float, # scaling hyperparameter
):
out_dim, in_dim = W.shape
self.W = W
self.r = r
self.scale = alpha / r
# Standard LoRA init: A ~ N(0, 1/r), B = 0
self.A = np.random.randn(r, in_dim) / np.sqrt(r) # [r, in_dim]
self.B = np.zeros((out_dim, r)) # [out_dim, r]
def forward(self, x: np.ndarray) -> np.ndarray:
"""x: [in_dim] -> [out_dim]."""
base = self.W @ x # frozen path
lora = self.B @ (self.A @ x) * self.scale # low-rank path
return base + lora
# Sanity: at init, B=0 means LoRA contributes nothing — the adapted model
# starts identical to the base model.
W = np.random.randn(64, 32)
layer = LoRALinear(W, r=4, alpha=16.0)
x = np.random.randn(32)
base_out = W @ x
lora_out = layer.forward(x)
assert np.allclose(base_out, lora_out), "at init, LoRA must be a no-op"
The common bugs.
- Scaling the wrong term:
(W + scale * B @ A) @ xis correct algebra but mixes frozen and trained parameters; keepB @ A @ xas a separate term so W stays frozen. - Initializing B randomly — B must be zero at init so the LoRA delta starts at zero. A non-zero B would immediately shift the model away from the base checkpoint.
- Applying
scaleoutside the forward pass (e.g., at merge time only) without accounting for it during training — loss gradients will be miscalibrated.
Stretch follow-ups.
- “How do you merge LoRA back into W for deployment?”
W_merged = W + scale * B @ A. Zero deployment overhead after merge. - “What is DoRA?” Decomposes W into magnitude and direction; updates direction with LoRA and magnitude with a scalar — better learning dynamics on some tasks.
§124.9 Cross-entropy loss with label smoothing
Problem statement. Implement cross-entropy loss from scratch. Add an optional label-smoothing parameter epsilon.
Why they’re asking. Cross-entropy is the loss function for language modeling. The interview checks: (a) the candidate uses log_softmax, not log(softmax), (b) label smoothing is applied correctly (distribute epsilon across all classes uniformly), and (c) the candidate can explain why label smoothing helps calibration.
import numpy as np
def cross_entropy(
logits: np.ndarray, # [vocab_size] or [batch, vocab_size]
targets: np.ndarray, # int scalar or [batch] of class indices
epsilon: float = 0.0, # label smoothing in [0, 1)
) -> float:
"""Cross-entropy loss with optional label smoothing."""
single = logits.ndim == 1
if single:
logits = logits[None] # [1, V]
targets = np.array([targets])
batch, V = logits.shape
log_probs = log_softmax(logits) # [batch, V]
if epsilon == 0.0:
# Hard labels: pick the log-prob of the target class
loss = -log_probs[np.arange(batch), targets]
else:
# Smooth labels: mix target with uniform distribution
# true target gets (1 - epsilon), every class gets epsilon / V
smooth = np.full_like(log_probs, epsilon / V)
smooth[np.arange(batch), targets] += (1.0 - epsilon)
loss = -(smooth * log_probs).sum(axis=-1)
return float(loss.mean()) if not single else float(loss[0])
def log_softmax(x):
x_max = x.max(axis=-1, keepdims=True)
lse = np.log(np.exp(x - x_max).sum(axis=-1, keepdims=True)) + x_max
return x - lse
The common bugs.
- Using
np.log(softmax(logits))instead oflog_softmax— precision loss when probabilities are very small. - Label smoothing that makes the smooth target sum to more than 1: the target class gets
(1 - epsilon) + epsilon/Vand all others getepsilon/V; that sums to 1. - Averaging over the vocabulary dimension before summing labels — the dot product
(smooth * log_probs).sum(axis=-1)is correct;.mean(axis=-1)is not.
Stretch follow-ups.
- “Why does label smoothing improve calibration?” The model can no longer drive the target logit to +∞ without incurring loss from the smoothed non-target classes; confidence is bounded.
- “How does temperature scaling (post-hoc calibration) differ from label smoothing?” Temperature scales logits at inference time to adjust confidence; label smoothing regularizes during training.
§124.10 RoPE rotation
Problem statement. Implement Rotary Position Embedding (RoPE). Precompute the cosine/sine rotation matrices for a given maximum sequence length and head dimension, then apply them to a query or key tensor.
Why they’re asking. RoPE is the position encoding in every modern LLM (Llama, Mistral, Qwen). The interviewer wants to see: (a) correct precomputation of θ frequencies, (b) the rotation applied pair-wise over the head dimension, and (c) the candidate noting that RoPE only touches Q and K, never V.
import numpy as np
def precompute_rope(d: int, max_seq: int, base: float = 10000.0):
"""Precompute cos and sin matrices for RoPE.
Returns cos, sin each of shape [max_seq, d//2].
"""
# Frequencies: theta_i = 1 / base^(2i / d) for i in [0, d/2)
i = np.arange(d // 2, dtype=np.float32)
theta = 1.0 / (base ** (2 * i / d)) # [d//2]
positions = np.arange(max_seq, dtype=np.float32) # [max_seq]
angles = np.outer(positions, theta) # [max_seq, d//2]
return np.cos(angles), np.sin(angles)
def apply_rope(
x: np.ndarray, # [seq, d] (Q or K)
cos: np.ndarray, # [max_seq, d//2]
sin: np.ndarray, # [max_seq, d//2]
offset: int = 0, # position offset for decode-time cache
) -> np.ndarray:
seq, d = x.shape
assert d % 2 == 0
# Split into pairs: x1 = even dims, x2 = odd dims
x1, x2 = x[:, :d//2], x[:, d//2:]
c = cos[offset : offset + seq] # [seq, d//2]
s = sin[offset : offset + seq]
# Rotate: [x1, x2] -> [x1*cos - x2*sin, x1*sin + x2*cos]
return np.concatenate([x1 * c - x2 * s, x1 * s + x2 * c], axis=-1)
# RoPE is only applied to Q and K, never to V.
The common bugs.
- Applying RoPE to V — rotations encode position into Q and K; V carries content and should not be rotated.
- Splitting the head dimension as
[0::2]and[1::2](interleaved) instead of[:d//2]and[d//2:](contiguous halves). The math is the same; the layout differs. Match the convention used in the checkpoint. - Forgetting the
offsetparameter in decode-time inference — position IDs must reflect the token’s actual position in the sequence, not its position within the current forward pass.
Stretch follow-ups.
- “What is YaRN / LongRoPE?” Extended-context techniques that interpolate or re-scale the
θfrequencies to allow longer sequences than the training context without full re-training. - “What breaks if you rotate in bfloat16?” Small angles lose precision for positions near zero; warm-up with fp32 and cast down, or keep the cos/sin table in fp32.
§124.11 Layer normalization (RMSNorm)
Problem statement. Implement RMSNorm — the variant of layer normalization used in Llama, Mistral, and most modern decoder-only models. Accept a learned scale vector gamma.
Why they’re asking. The distinction between LayerNorm and RMSNorm matters: RMSNorm drops the mean-centering and bias terms, reducing compute. Interviewers ask it to see if the candidate knows why the mean is dropped (empirically, centering adds cost without accuracy gain in large models) and whether they can derive the formula from definition.
import numpy as np
def rms_norm(
x: np.ndarray, # [..., d_model]
gamma: np.ndarray, # [d_model], learned scale
eps: float = 1e-6,
) -> np.ndarray:
"""RMSNorm: normalize by root-mean-square, then scale by gamma."""
# RMS over the last dimension
rms = np.sqrt((x ** 2).mean(axis=-1, keepdims=True) + eps)
return (x / rms) * gamma
def layer_norm(
x: np.ndarray, # [..., d_model]
gamma: np.ndarray, # [d_model]
beta: np.ndarray, # [d_model], bias (not in RMSNorm)
eps: float = 1e-5,
) -> np.ndarray:
"""Standard LayerNorm for reference."""
mean = x.mean(axis=-1, keepdims=True)
var = ((x - mean) ** 2).mean(axis=-1, keepdims=True)
return gamma * (x - mean) / np.sqrt(var + eps) + beta
The common bugs.
- Using
epsinside the square root asnp.sqrt(rms² + eps)instead ofnp.sqrt(rms² + eps)— they are the same; the bug is computingmeanofx²first and addingepsbefore the sqrt, which is correct, vs addingepsafter the sqrt, which computesrms + eps(wrong). - Applying
gammaas a matrix multiply instead of element-wise: it’s a scale, not a projection. - Forgetting to set
eps— without it, a zero-activation vector causes division by zero.
Stretch follow-ups.
- “Why does pre-norm (norm before attention) train more stably than post-norm?” Gradient scale is bounded at each residual branch entry; post-norm can cause very deep models to have vanishing gradients in early layers.
- “What is the memory cost of storing norms vs activations in the backward pass?” Norms need to store the pre-norm activations for the backward pass — same cost as any activation checkpoint.
§124.12 Adam / AdamW optimizer step
Problem statement. Implement a single Adam optimizer step from scratch. Then extend it to AdamW (decoupled weight decay).
Why they’re asking. Optimizer state is the primary memory cost in training after the model weights. The interviewer wants to see: (a) the bias-correction terms, (b) the eps in the right place (denominator, not inside the square root), and (c) why AdamW differs from L2 regularization.
import numpy as np
from typing import Dict
class AdamW:
"""AdamW optimizer — one parameter group."""
def __init__(
self,
lr: float = 1e-3,
beta1: float = 0.9,
beta2: float = 0.999,
eps: float = 1e-8,
weight_decay: float = 0.01,
):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.wd = weight_decay
self.m: Dict[str, np.ndarray] = {} # first moment
self.v: Dict[str, np.ndarray] = {} # second moment
self.t = 0
def step(self, params: Dict[str, np.ndarray], grads: Dict[str, np.ndarray]):
self.t += 1
b1, b2 = self.beta1, self.beta2
for name, p in params.items():
g = grads[name]
if name not in self.m:
self.m[name] = np.zeros_like(p)
self.v[name] = np.zeros_like(p)
# Update biased first and second moment estimates
self.m[name] = b1 * self.m[name] + (1 - b1) * g
self.v[name] = b2 * self.v[name] + (1 - b2) * g ** 2
# Bias-corrected estimates
m_hat = self.m[name] / (1 - b1 ** self.t)
v_hat = self.v[name] / (1 - b2 ** self.t)
# AdamW: apply weight decay directly to weights (not via gradient)
p -= self.lr * self.wd * p
# Adam update
p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
The common bugs.
- Missing bias correction — in early steps,
mandvare both biased toward zero, which dramatically under-estimates the gradient signal. - Putting
epsinside the square root:np.sqrt(v_hat + eps)instead ofnp.sqrt(v_hat) + eps. The two are numerically different; the PyTorch convention isepsoutside the root. - Implementing L2 regularization (adding
wd * pto the gradient) instead of AdamW (applyingwddirectly to the weight). L2 interacts with the adaptive step size; AdamW does not.
Stretch follow-ups.
- “What is the memory cost of Adam vs SGD for a 70B model?” Adam stores
mandv— 2× the parameter count. For 70B fp32 parameters: 70B × 3 × 4 bytes ≈ 840 GB just for optimizer state. - “What is Muon / Shampoo?” Second-order or matrix-preconditioned optimizers that estimate curvature across the full parameter matrix, not per-element.
§124.13 Gradient accumulation wrapper
Problem statement. Implement a gradient accumulation wrapper. Given a model with .forward(x), .backward(loss), and a .params dict with associated .grads, accumulate gradients over n_steps mini-batches before applying one optimizer step.
Why they’re asking. Gradient accumulation is the standard technique for training with effective batch sizes that exceed GPU memory. The interviewer is checking: (a) the candidate zeroes gradients before accumulation starts, (b) the loss is scaled by 1/n_steps so the average matches a true large-batch step, and (c) the optimizer step fires exactly once per accumulation window.
class GradAccumulator:
"""Accumulate gradients over n_steps before one optimizer step."""
def __init__(self, model, optimizer, n_steps: int):
self.model = model
self.optimizer = optimizer
self.n = n_steps
self._step = 0
self._zero_grads()
def _zero_grads(self):
for name in self.model.params:
self.model.grads[name] = 0.0
def step(self, x, y):
"""Returns loss (scaled). Applies optimizer every n_steps calls."""
logits = self.model.forward(x)
loss = self.model.loss_fn(logits, y)
# Scale loss so accumulated gradient == gradient of average loss
scaled_loss = loss / self.n
self.model.backward(scaled_loss)
self._step += 1
if self._step % self.n == 0:
self.optimizer.step(self.model.params, self.model.grads)
self._zero_grads()
return float(loss)
The common bugs.
- Forgetting to zero gradients at the start of each accumulation window — gradients pile up across windows, producing incorrect updates.
- Not scaling the loss by
1/n_steps— the accumulated gradient becomesntimes the correct value, which destabilizes training. - Calling the optimizer step on every mini-batch — gradient accumulation’s entire purpose is to fire the optimizer once per window.
Stretch follow-ups.
- “How does gradient accumulation interact with batch normalization?” BatchNorm statistics are computed per mini-batch, not per accumulation window — BatchNorm and gradient accumulation are fundamentally incompatible. Use LayerNorm.
- “What is the equivalent in PyTorch?”
loss.backward()accumulates into.gradby default; calloptimizer.zero_grad()only at the start of the window,optimizer.step()only at the end.
§124.14 Brute-force vector search with cosine similarity
Problem statement. Given a matrix of stored vectors index (shape [n, d]) and a query vector q (shape [d]), return the indices of the top-k most similar vectors by cosine similarity.
Why they’re asking. Brute-force search is the correctness baseline every ANN index is tested against. The interviewer wants: (a) correct L2 normalization (cosine = dot product of unit vectors), (b) efficient np.argsort usage (not sorting the full list just to take k), and (c) awareness of when brute force is the right choice (n < 10k, high-frequency updates).
import numpy as np
def cosine_top_k(
index: np.ndarray, # [n, d]
q: np.ndarray, # [d]
k: int,
) -> np.ndarray: # [k] indices, best first
# Normalize
index_norm = index / (np.linalg.norm(index, axis=1, keepdims=True) + 1e-8)
q_norm = q / (np.linalg.norm(q) + 1e-8)
# Cosine similarity = dot product of unit vectors
scores = index_norm @ q_norm # [n]
# argpartition to find top-k in O(n) then sort just those k
if k >= len(scores):
top_k_idx = np.arange(len(scores))
else:
top_k_idx = np.argpartition(scores, -k)[-k:]
# Sort the k candidates descending by score
top_k_idx = top_k_idx[np.argsort(scores[top_k_idx])[::-1]]
return top_k_idx
The common bugs.
- Using
np.argsort(scores)[-k:]— this sorts allnscores;np.argpartitionis O(n) vs O(n log n) and a visible win for largen. - Not normalizing before the dot product — unnormalized dot product is inner product, not cosine similarity.
- Division by zero for zero-norm vectors — adding a small
epsto the denominator handles this.
Stretch follow-ups.
- “What is the complexity, and when does it become unacceptable?” O(n × d) for the matmul. Acceptable up to roughly n = 100k at d = 768 on a modern CPU. Beyond that, use HNSW or FAISS IVF.
- “How do you handle vectors that were added after the initial build?” Brute-force handles it trivially by appending to the index matrix. HNSW requires a graph insert; IVF requires coarse assignment.
§124.15 HNSW search
Problem statement. Given a pre-built HNSW graph (a dict from node ID to dict of layer → list of neighbor IDs), and a distance function, implement the greedy search from the entry point down to layer 0, returning the top-k nearest neighbors.
Why they’re asking. HNSW is the dominant algorithm behind every production vector index (FAISS, Qdrant, Weaviate, pgvector). The interview tests whether the candidate understands layered structure (greedy descent at upper layers, beam-expanded search at layer 0) — not just that “HNSW is fast.”
import numpy as np
import heapq
from typing import Dict, List, Callable
def hnsw_search(
graph: Dict[int, Dict[int, List[int]]], # node -> layer -> [neighbor_ids]
vectors: np.ndarray, # [n, d] all node vectors
query: np.ndarray, # [d]
entry_point: int,
num_layers: int,
ef: int, # search width at layer 0
k: int,
) -> List[int]:
"""Greedy HNSW search. Returns k nearest neighbor IDs."""
def dist(a_id: int) -> float:
return float(np.linalg.norm(vectors[a_id] - query))
current = entry_point
# Phase 1: greedy descent through upper layers (L down to 1)
for layer in range(num_layers, 0, -1):
improved = True
while improved:
improved = False
neighbors = graph.get(current, {}).get(layer, [])
for nb in neighbors:
if dist(nb) < dist(current):
current = nb
improved = True
# Phase 2: beam search at layer 0 with ef candidates
candidates = [(dist(current), current)] # min-heap by distance
visited = {current}
result = [(dist(current), current)] # best k found
while candidates:
d_c, c = heapq.heappop(candidates)
# Prune: if the closest candidate is worse than the kth result, stop
if result and d_c > sorted(result)[min(k - 1, len(result) - 1)][0]:
break
for nb in graph.get(c, {}).get(0, []):
if nb not in visited:
visited.add(nb)
d_nb = dist(nb)
heapq.heappush(candidates, (d_nb, nb))
result.append((d_nb, nb))
result.sort()
return [node for _, node in result[:k]]
The common bugs.
- Running the beam-expanded search at every layer — the whole point of upper layers is a single greedy hop per node.
- Pruning the layer-0 search too aggressively by setting
ef = k—efshould be larger thank(typically 2-10×) to improve recall. - Not tracking
visited— without it, the graph walk can revisit nodes exponentially.
Stretch follow-ups.
- “What are
Mandef_construction?”Mis the number of neighbors per node during graph construction;ef_constructionis the search width used during insert. Both trade index build time and memory against recall. - “How does Qdrant use HNSW differently from FAISS HNSW?” Qdrant exposes per-collection quantization and per-segment HNSW graphs; FAISS treats the whole index as one flat structure.
§124.16 BM25 scorer
Problem statement. Implement BM25 scoring. Given a collection of documents and a query, score all documents against the query.
Why they’re asking. BM25 is the statistical backbone of hybrid search. Every RAG system that combines dense and sparse retrieval uses BM25 as the sparse component. The interviewer wants to see: (a) the IDF formula (with or without the +1 smoothing — note which one), (b) the TF saturation term (k1 + 1) in the numerator, and (c) the length normalization.
import math
from collections import defaultdict
from typing import List, Dict
def build_bm25_index(docs: List[List[str]]) -> Dict:
"""docs: list of tokenized documents (list of word strings)."""
N = len(docs)
df: Dict[str, int] = defaultdict(int) # doc frequency
tf: List[Dict[str, int]] = [] # term frequency per doc
lengths = []
for doc in docs:
lengths.append(len(doc))
tf_doc: Dict[str, int] = defaultdict(int)
for term in doc:
tf_doc[term] += 1
tf.append(tf_doc)
for term in set(doc):
df[term] += 1
avg_dl = sum(lengths) / N
return {"N": N, "df": df, "tf": tf, "lengths": lengths, "avg_dl": avg_dl}
def bm25_score(
idx: Dict,
query_terms: List[str],
k1: float = 1.5,
b: float = 0.75,
) -> List[float]:
"""Return BM25 score for every document."""
N, df, tf_list = idx["N"], idx["df"], idx["tf"]
avg_dl, lengths = idx["avg_dl"], idx["lengths"]
scores = []
for i, tf_doc in enumerate(tf_list):
score = 0.0
dl = lengths[i]
for term in query_terms:
if term not in df:
continue
idf = math.log((N - df[term] + 0.5) / (df[term] + 0.5) + 1.0)
tf_val = tf_doc.get(term, 0)
tf_norm = tf_val * (k1 + 1) / (tf_val + k1 * (1 - b + b * dl / avg_dl))
score += idf * tf_norm
scores.append(score)
return scores
The common bugs.
- Using
log(N / df)instead of the Robertson-Spärck Jones IDF with the+0.5smoothing — the smoothed version handles terms that appear in every document without going to zero. - Forgetting the
(k1 + 1)in the numerator of the TF term — without it, the saturation behavior is wrong. - Using raw TF instead of normalized TF — raw TF rewards longer documents; the
(1 - b + b * dl / avg_dl)term is the length normalization.
Stretch follow-ups.
- “How do you combine BM25 and dense cosine scores?” Reciprocal Rank Fusion (RRF) or linear combination with learned weights. RRF is score-scale-agnostic, which makes it robust.
- “What is BM25F?” An extension that applies different field weights and length normalizations to structured fields (title, body, metadata).
§124.17 Token-bucket rate limiter
Problem statement. Implement a token-bucket rate limiter. Support capacity tokens, a refill rate of rate tokens per second, and a consume(n) method that returns True if the request is allowed.
Why they’re asking. Rate limiting is a first-class concern in LLM APIs — tokens per minute, requests per second, per-user budgets. The interview checks: (a) monotonic clock usage (not time.time in production but acceptable here), (b) refill-on-demand (lazy computation instead of a background thread), and (c) correct clamping to capacity.
import time
class TokenBucket:
"""Token bucket rate limiter. Thread-safe variant would add a lock."""
def __init__(self, capacity: float, rate: float):
"""capacity: max tokens. rate: tokens refilled per second."""
self.capacity = capacity
self.rate = rate
self.tokens = float(capacity) # start full
self.last_refill = time.monotonic()
def _refill(self):
now = time.monotonic()
elapsed = now - self.last_refill
# Add tokens earned since last call, but no more than capacity
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_refill = now
def consume(self, n: float = 1.0) -> bool:
"""Returns True if n tokens are available (and consumes them)."""
self._refill()
if self.tokens >= n:
self.tokens -= n
return True
return False
The common bugs.
- Using a background thread to refill — unnecessary; lazy refill on every
consumeis correct and simpler. - Using
time.time()— wall clock can go backwards (NTP corrections).time.monotonic()is the correct choice. - Not clamping to
capacityduring refill — uncapped refill lets tokens accumulate across idle periods, creating burst spikes larger than the intended capacity.
Stretch follow-ups.
- “How do you make this distributed (across multiple API gateway instances)?” Use Redis
INCR+EXPIREfor a sliding window, or use Redis with a Lua script that atomically checks and decrements the bucket. - “What is the difference between token bucket and leaky bucket?” Token bucket allows burst up to capacity; leaky bucket enforces a strictly constant output rate.
§124.18 LRU cache
Problem statement. Implement an LRU cache with a fixed capacity. Support get(key) (return -1 if missing) and put(key, value).
Why they’re asking. LRU cache is the classic interview question, but in ML systems interviews it shows up concretely: embedding caches, KV prefix caches, prompt caches. The interviewer wants OrderedDict used fluently — the candidate who reaches for a doubly-linked-list + hashmap is doing more work than necessary.
from collections import OrderedDict
from typing import Any, Optional
class LRUCache:
"""O(1) get and put using OrderedDict's move_to_end."""
def __init__(self, capacity: int):
self.cap = capacity
self.cache: OrderedDict = OrderedDict()
def get(self, key: Any) -> Optional[Any]:
if key not in self.cache:
return None
self.cache.move_to_end(key) # mark as recently used
return self.cache[key]
def put(self, key: Any, value: Any):
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.cap:
self.cache.popitem(last=False) # evict LRU (head of OrderedDict)
The common bugs.
- Forgetting
move_to_endon agethit — the cache becomes FIFO, not LRU. - Using
popitem(last=True)for eviction — this evicts the MRU (most recent), the opposite of the intended behavior. - Not handling the case where a key being
putalready exists — withoutmove_to_endbefore the assignment, the existing key’s order does not update.
Stretch follow-ups.
- “How do you make this thread-safe?” Wrap
getandputwiththreading.Lock(). For high contention, use a striped lock (sharded by key hash). - “How does vLLM’s prefix cache differ?” It’s block-granular — cache entries are blocks of KV cache for a prefix hash, not individual embeddings. Eviction is by LRU but operates on blocks.
§124.19 Speculative decoding verification loop
Problem statement. Implement the verification loop for speculative decoding. Given: draft_tokens (list of K token IDs proposed by the draft model), target_logits (shape [K+1, vocab_size] — the target model’s logits for each draft token position plus one more), return the accepted prefix and, if a rejection occurred, a resampled replacement token.
Why they’re asking. Speculative decoding is the inference speedup technique interviewers most frequently probe in depth. The verification loop is the algorithmic core. The candidate must show: (a) the rejection sampling criterion (accept token t if target_prob[t] / draft_prob[t] >= Uniform(0,1)), (b) the adjusted distribution on rejection, and (c) why all K tokens can be verified in a single target model forward pass.
import numpy as np
from typing import List, Tuple
def verify_speculative(
draft_tokens: List[int], # [K] token IDs from draft model
draft_probs: np.ndarray, # [K, V] — draft prob for each position
target_logits: np.ndarray, # [K+1, V] — target model output
) -> Tuple[List[int], int]:
"""
Returns (accepted_tokens, bonus_token).
accepted_tokens: the prefix of draft_tokens the target accepts (may be empty).
bonus_token: one additional token sampled from the adjusted distribution.
"""
K, V = draft_probs.shape
def softmax(x):
x = x - x.max()
e = np.exp(x)
return e / e.sum()
target_probs = np.array([softmax(target_logits[i]) for i in range(K + 1)])
# [K+1, V]
accepted = []
for i, t in enumerate(draft_tokens):
p_target = target_probs[i, t]
p_draft = draft_probs[i, t]
# Accept with probability min(1, target/draft)
r = np.random.uniform()
if r <= p_target / (p_draft + 1e-9):
accepted.append(t)
else:
# Rejection: resample from the residual distribution
residual = np.maximum(target_probs[i] - draft_probs[i], 0.0)
residual /= residual.sum() + 1e-9
bonus = int(np.random.choice(V, p=residual))
return accepted, bonus
# All K tokens accepted; also sample the bonus token from position K
bonus = int(np.random.choice(V, p=target_probs[K]))
return accepted, bonus
The common bugs.
- Rejecting based on
target_prob < draft_prob(a threshold comparison) instead of the acceptance ratiomin(1, target/draft)— the ratio is the mathematically correct criterion that preserves the target distribution. - Not computing the residual distribution on rejection — simply sampling from
target_probs[i]produces correct distribution but forfeits the lossless guarantee; the residualmax(p_target - p_draft, 0)is necessary. - Using
Ktarget logits instead ofK+1— the bonus token at position K requires a logit beyond the last draft token.
Stretch follow-ups.
- “What acceptance rate makes speculative decoding worthwhile?” If the draft model accepts at rate
α, the expected accepted tokens per target forward pass isK·α + 1. Break-even is roughlyα > (target_cost - draft_cost×K) / target_cost. - “What is Medusa?” It replaces the draft model with multiple independent prediction heads on the same model — eliminating the separate draft model at the cost of slightly lower acceptance rates.
§124.20 Naive ring all-reduce
Problem statement. Implement a simple ring all-reduce for gradient synchronization across n workers. Each worker has a numpy array of gradients. After the all-reduce, every worker should hold the sum of all workers’ arrays.
Why they’re asking. All-reduce is the communication primitive behind every data-parallel training job. The naive ring algorithm has O(n) bandwidth cost per worker (vs O(n) for a naive all-gather) and O(1) additional memory. The interviewer wants: (a) the two-phase ring structure (reduce-scatter then all-gather), and (b) awareness that this is a simulation — real all-reduce uses NCCL, which pipelines across chunks.
import numpy as np
from typing import List
def ring_all_reduce(workers: List[np.ndarray]) -> List[np.ndarray]:
"""
Naive ring all-reduce (in-process simulation).
workers: list of n arrays, each shape [d]. Returns list of n arrays
each containing the element-wise sum.
Two phases:
1. Reduce-scatter: each worker accumulates a shard from its left neighbor.
2. All-gather: each worker broadcasts its fully-reduced shard.
"""
n = len(workers)
d = workers[0].shape[0]
assert all(w.shape == (d,) for w in workers)
# Working copy so we don't mutate inputs
bufs = [w.copy() for w in workers]
chunk = d // n # assume d divisible by n for simplicity
# --- Phase 1: reduce-scatter (n-1 rounds) ---
for step in range(n - 1):
for i in range(n):
src = (i - step - 1) % n # who sends to me this round
shard = src % n # which shard flows this round
s = shard * chunk
e = s + chunk
bufs[i][s:e] += workers[src][s:e] # accumulate from left
# --- Phase 2: all-gather (n-1 rounds) ---
for step in range(n - 1):
for i in range(n):
src = (i - step - 1) % n
shard = (i - step) % n
s = shard * chunk
e = s + chunk
bufs[i][s:e] = bufs[src][s:e] # copy fully-reduced shard
return bufs
# Verify: every worker should hold the element-wise sum
np.random.seed(0)
workers = [np.random.randn(8) for _ in range(4)]
expected_sum = sum(workers)
results = ring_all_reduce(workers)
for r in results:
assert np.allclose(r, expected_sum, atol=1e-6), "all-reduce incorrect"
The common bugs.
- Running only one phase — reduce-scatter alone gives each worker one correct shard, not the full sum; all-gather is required.
- Wrong ring direction: both phases must use the same direction (left or right); mixing them produces garbage.
- Mutating the input
workerslist during accumulation — always work on a copy so each step reads clean source values.
Stretch follow-ups.
- “Why is ring all-reduce preferred over a parameter server?” A parameter server creates O(n) traffic on the server; ring all-reduce distributes traffic evenly across all workers, saturating the interconnect equally.
- “What does NCCL add on top of this?” Hardware-aware chunk pipelining (overlaps compute and communication), NVLink and InfiniBand topology awareness, and kernels that operate directly in GPU HBM without CPU involvement.
Read it yourself
- Andrej Karpathy, nanoGPT (GitHub). The cleanest single-file reference for attention, multi-head, cross-entropy, and the autoregressive loop — reads in under an hour.
- Sebastian Raschka, Build a Large Language Model (From Scratch). Chapter-by-chapter Python implementations of every algorithm in this chapter.
- The BPE paper: Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units” (2016) — the original merge-table algorithm.
- Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models” (2021) — the B=0 initialization is in Section 4.1.
- Leviathan et al., “Fast Inference from Transformers via Speculative Decoding” (2023) — the acceptance criterion derivation is in Section 2.3.
- Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding” (2021) — the pair-wise rotation derivation from first principles.
- The HNSW paper: Malkov & Yashunin, “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs” (2018) — search algorithm in Algorithm 5.
Practice
- Implement all twenty algorithms from memory, one per day, without reference. Time yourself; target under 15 minutes per algorithm.
- For each algorithm, write one unit test that catches the most common bug listed in its section. Run it before you consider the implementation done.
- Pick five algorithms and explain, out loud, the numerical or algorithmic trap that makes the naive implementation wrong. Practice until the explanation takes under 60 seconds.
- Chain §124.7 (KV cache) and §124.2 (attention) into a working decode loop: given a sequence of embeddings, produce one new token per step using the cached KV.
- Combine §124.14 (brute-force cosine search) and §124.16 (BM25) into a simple hybrid scorer using reciprocal rank fusion. Test that the fused scores differ from either alone.
- Stretch: Implement §124.20 (ring all-reduce) with
n=8workers and verify the result against a simplesum(workers). Then replace the in-process simulation with socket-based communication between threads. - Stretch: Take §124.19 (speculative decoding) and wire it to a real HuggingFace model as the target and a smaller model as the draft. Measure the wall-clock speedup as a function of
Kand the acceptance rate.