Notes Flashcards Playground Tools Formulas Comparisons Cheat sheets PDF

Formula reference

Every formula from the book on one page. Pull this up during an interview.

Attention & Transformer

Name	Formula	Note
Attention scores	`QK^T / sqrt(d_k)`	Scaled dot-product. sqrt(d_k) keeps variance at 1.
Attention output	`softmax(QK^T / sqrt(d_k)) V`	Weighted sum of values.
Score matrix size	`O(s^2) entries`	s = sequence length. This is why FlashAttention matters.
Multi-head split	`d_h = d_model / H`	H heads, each sees d_h dimensions. Total compute unchanged.
Transformer FLOPs/token	`~6 * N (forward) or ~8 * N (fwd+bwd)`	N = parameter count. Rule of thumb.

KV Cache

Name	Formula	Note
KV cache per token	`2 * n_layers * n_kv_heads * d_head * bytes`	2 for K and V. Llama 3 70B = 320 KB/token in bf16.
KV cache total	`per_token * seq_len * batch_size`	Dominates HBM at any serious concurrency.
GQA reduction factor	`n_q_heads / n_kv_heads`	Llama 3 70B: 64/8 = 8x smaller KV cache vs MHA.

Inference Performance

Name	Formula	Note
Prefill arithmetic intensity	`FLOPs / bytes_loaded ~ 2 * d_model`	Compute-bound. GPU does useful work.
Decode arithmetic intensity	`~1 (one token through all weights)`	Memory-bandwidth-bound. GPU mostly waits for HBM.
TTFT	`queue_delay + input_tokens / prefill_rate`	Time to first token. User-perceived latency.
TPOT	`1 / decode_rate`	Time per output token. ~12-25ms for 70B on H100.
End-to-end latency	`TTFT + (output_tokens * TPOT)`	Decode dominates for long outputs.
Model weight memory	`params * bytes_per_param`	70B bf16 = 140 GB. INT4 = 35 GB.

Cost & Capacity

Name	Formula	Note
Cost per 1M tokens	`GPU_cost_per_hour / tokens_per_hour * 1M`	Utilization is the lever that moves cost 2-5x.
Chinchilla optimal	`~20 tokens per parameter`	Compute-optimal. Modern frontier over-trains 5-10x.
Little's Law	`L = lambda * W`	Concurrent requests = arrival rate * avg latency.
Compound availability	`A_total = A1 * A2 * ... * An (serial)`	Four 99.9% deps = 99.6% ceiling.

Training

Name	Formula	Note
AdamW memory	`8 bytes/param (m1 + m2 in fp32)`	70B model: 560 GB optimizer state alone.
Training memory total	`weights + grads + optimizer + activations`	~16-20 bytes/param minimum. 70B needs ~1+ TB.
LoRA trainable params	`(m + n) * r per adapted layer`	r=16, 4096x4096 layer: 131K vs 16.7M. ~128x reduction.
QLoRA memory	`4-bit base + bf16 LoRA adapters`	70B on single 48 GB GPU.

GPU & Compilation

Name	Formula	Note
GPU theoretical FLOPS	`SMs x cores/SM x clock x 2 (FMA)`	H100 SXM: 132 SMs x 512 cores x 1.83 GHz x 2 = 989 TFLOPS bf16.
Arithmetic intensity	`FLOPs / bytes_loaded`	Compute-bound if AI > machine balance. Decode AI~1, prefill AI~2*d_model.
HBM bandwidth utilization	`bytes_moved / (time x peak_BW) x 100%`	H100: 3.35 TB/s peak. Most kernels achieve 60-80%.
Occupancy	`active_warps / max_warps_per_SM`	Max 64 warps/SM on H100. Higher != always faster.
Fusion speedup	`~N x HBM_round_trips_saved`	Fusing 3 elementwise ops saves 2 HBM reads + 2 writes.

Retrieval

Name	Formula	Note
BM25 score	`sum(IDF(t) * (tf * (k1+1)) / (tf + k1(1-b+bdl/avgdl)))`	k1~1.5, b~0.75. Saturating TF + length normalization.
PQ compression	`M subquantizers * log2(K) bits per vector`	M=32 subspaces, K=256 centroids = 32 bytes/vector.
RRF fusion	`score = sum(1 / (k + rank_i))`	k=60. Rank-only, no score normalization needed.