Formula reference

Every formula from the book on one page. Pull this up during an interview.

Attention & Transformer

Name Formula Note
Attention scores QK^T / sqrt(d_k) Scaled dot-product. sqrt(d_k) keeps variance at 1.
Attention output softmax(QK^T / sqrt(d_k)) V Weighted sum of values.
Score matrix size O(s^2) entries s = sequence length. This is why FlashAttention matters.
Multi-head split d_h = d_model / H H heads, each sees d_h dimensions. Total compute unchanged.
Transformer FLOPs/token ~6 * N (forward) or ~8 * N (fwd+bwd) N = parameter count. Rule of thumb.

KV Cache

Name Formula Note
KV cache per token 2 * n_layers * n_kv_heads * d_head * bytes 2 for K and V. Llama 3 70B = 320 KB/token in bf16.
KV cache total per_token * seq_len * batch_size Dominates HBM at any serious concurrency.
GQA reduction factor n_q_heads / n_kv_heads Llama 3 70B: 64/8 = 8x smaller KV cache vs MHA.

Inference Performance

Name Formula Note
Prefill arithmetic intensity FLOPs / bytes_loaded ~ 2 * d_model Compute-bound. GPU does useful work.
Decode arithmetic intensity ~1 (one token through all weights) Memory-bandwidth-bound. GPU mostly waits for HBM.
TTFT queue_delay + input_tokens / prefill_rate Time to first token. User-perceived latency.
TPOT 1 / decode_rate Time per output token. ~12-25ms for 70B on H100.
End-to-end latency TTFT + (output_tokens * TPOT) Decode dominates for long outputs.
Model weight memory params * bytes_per_param 70B bf16 = 140 GB. INT4 = 35 GB.

Cost & Capacity

Name Formula Note
Cost per 1M tokens GPU_cost_per_hour / tokens_per_hour * 1M Utilization is the lever that moves cost 2-5x.
Chinchilla optimal ~20 tokens per parameter Compute-optimal. Modern frontier over-trains 5-10x.
Little's Law L = lambda * W Concurrent requests = arrival rate * avg latency.
Compound availability A_total = A1 * A2 * ... * An (serial) Four 99.9% deps = 99.6% ceiling.

Training

Name Formula Note
AdamW memory 8 bytes/param (m1 + m2 in fp32) 70B model: 560 GB optimizer state alone.
Training memory total weights + grads + optimizer + activations ~16-20 bytes/param minimum. 70B needs ~1+ TB.
LoRA trainable params (m + n) * r per adapted layer r=16, 4096x4096 layer: 131K vs 16.7M. ~128x reduction.
QLoRA memory 4-bit base + bf16 LoRA adapters 70B on single 48 GB GPU.

GPU & Compilation

Name Formula Note
GPU theoretical FLOPS SMs x cores/SM x clock x 2 (FMA) H100 SXM: 132 SMs x 512 cores x 1.83 GHz x 2 = 989 TFLOPS bf16.
Arithmetic intensity FLOPs / bytes_loaded Compute-bound if AI > machine balance. Decode AI~1, prefill AI~2*d_model.
HBM bandwidth utilization bytes_moved / (time x peak_BW) x 100% H100: 3.35 TB/s peak. Most kernels achieve 60-80%.
Occupancy active_warps / max_warps_per_SM Max 64 warps/SM on H100. Higher != always faster.
Fusion speedup ~N x HBM_round_trips_saved Fusing 3 elementwise ops saves 2 HBM reads + 2 writes.

Retrieval

Name Formula Note
BM25 score sum(IDF(t) * (tf * (k1+1)) / (tf + k1*(1-b+b*dl/avgdl))) k1~1.5, b~0.75. Saturating TF + length normalization.
PQ compression M subquantizers * log2(K) bits per vector M=32 subspaces, K=256 centroids = 32 bytes/vector.
RRF fusion score = sum(1 / (k + rank_i)) k=60. Rank-only, no score normalization needed.