Formula reference
Every formula from the book on one page. Pull this up during an interview.
Attention & Transformer
| Name | Formula | Note |
|---|---|---|
| Attention scores | QK^T / sqrt(d_k) | Scaled dot-product. sqrt(d_k) keeps variance at 1. |
| Attention output | softmax(QK^T / sqrt(d_k)) V | Weighted sum of values. |
| Score matrix size | O(s^2) entries | s = sequence length. This is why FlashAttention matters. |
| Multi-head split | d_h = d_model / H | H heads, each sees d_h dimensions. Total compute unchanged. |
| Transformer FLOPs/token | ~6 * N (forward) or ~8 * N (fwd+bwd) | N = parameter count. Rule of thumb. |
KV Cache
| Name | Formula | Note |
|---|---|---|
| KV cache per token | 2 * n_layers * n_kv_heads * d_head * bytes | 2 for K and V. Llama 3 70B = 320 KB/token in bf16. |
| KV cache total | per_token * seq_len * batch_size | Dominates HBM at any serious concurrency. |
| GQA reduction factor | n_q_heads / n_kv_heads | Llama 3 70B: 64/8 = 8x smaller KV cache vs MHA. |
Inference Performance
| Name | Formula | Note |
|---|---|---|
| Prefill arithmetic intensity | FLOPs / bytes_loaded ~ 2 * d_model | Compute-bound. GPU does useful work. |
| Decode arithmetic intensity | ~1 (one token through all weights) | Memory-bandwidth-bound. GPU mostly waits for HBM. |
| TTFT | queue_delay + input_tokens / prefill_rate | Time to first token. User-perceived latency. |
| TPOT | 1 / decode_rate | Time per output token. ~12-25ms for 70B on H100. |
| End-to-end latency | TTFT + (output_tokens * TPOT) | Decode dominates for long outputs. |
| Model weight memory | params * bytes_per_param | 70B bf16 = 140 GB. INT4 = 35 GB. |
Cost & Capacity
| Name | Formula | Note |
|---|---|---|
| Cost per 1M tokens | GPU_cost_per_hour / tokens_per_hour * 1M | Utilization is the lever that moves cost 2-5x. |
| Chinchilla optimal | ~20 tokens per parameter | Compute-optimal. Modern frontier over-trains 5-10x. |
| Little's Law | L = lambda * W | Concurrent requests = arrival rate * avg latency. |
| Compound availability | A_total = A1 * A2 * ... * An (serial) | Four 99.9% deps = 99.6% ceiling. |
Training
| Name | Formula | Note |
|---|---|---|
| AdamW memory | 8 bytes/param (m1 + m2 in fp32) | 70B model: 560 GB optimizer state alone. |
| Training memory total | weights + grads + optimizer + activations | ~16-20 bytes/param minimum. 70B needs ~1+ TB. |
| LoRA trainable params | (m + n) * r per adapted layer | r=16, 4096x4096 layer: 131K vs 16.7M. ~128x reduction. |
| QLoRA memory | 4-bit base + bf16 LoRA adapters | 70B on single 48 GB GPU. |
GPU & Compilation
| Name | Formula | Note |
|---|---|---|
| GPU theoretical FLOPS | SMs x cores/SM x clock x 2 (FMA) | H100 SXM: 132 SMs x 512 cores x 1.83 GHz x 2 = 989 TFLOPS bf16. |
| Arithmetic intensity | FLOPs / bytes_loaded | Compute-bound if AI > machine balance. Decode AI~1, prefill AI~2*d_model. |
| HBM bandwidth utilization | bytes_moved / (time x peak_BW) x 100% | H100: 3.35 TB/s peak. Most kernels achieve 60-80%. |
| Occupancy | active_warps / max_warps_per_SM | Max 64 warps/SM on H100. Higher != always faster. |
| Fusion speedup | ~N x HBM_round_trips_saved | Fusing 3 elementwise ops saves 2 HBM reads + 2 writes. |
Retrieval
| Name | Formula | Note |
|---|---|---|
| BM25 score | sum(IDF(t) * (tf * (k1+1)) / (tf + k1*(1-b+b*dl/avgdl))) | k1~1.5, b~0.75. Saturating TF + length normalization. |
| PQ compression | M subquantizers * log2(K) bits per vector | M=32 subspaces, K=256 centroids = 32 bytes/vector. |
| RRF fusion | score = sum(1 / (k + rank_i)) | k=60. Rank-only, no score normalization needed. |