Comparison tables

Every "X vs Y" from the book in one place. The quick-reference for interviews.

Attention & Architecture

MHA vs MQA vs GQA vs MLA Ch 33

Encoder-only vs Decoder-only vs Encoder-Decoder Ch 7

DDP vs FSDP (ZeRO) Ch 12

	DDP	FSDP / ZeRO-3
Model replication	Full model on each GPU	Sharded across GPUs
Memory per GPU	Full model + optimizer	Shard only
Communication	All-reduce gradients	All-gather params + reduce-scatter grads
When to use	Model fits on 1 GPU	Model too large for 1 GPU
Throughput	Higher (less comm)	Lower (more comm, but scales)

LoRA vs QLoRA vs Full fine-tuning Ch 15

RLHF vs DPO vs KTO Ch 17

Prefill vs Decode Ch 21

	Prefill	Decode
Input	Full prompt (many tokens)	1 token per step
Bound by	Compute (high AI)	Memory bandwidth (AI~1)
Parallelism	Fully parallel across tokens	Sequential (autoregressive)
Latency metric	TTFT	TPOT
GPU utilization	High	Low (weight-loading dominated)

Serving frameworks Ch 42

	vLLM	SGLang	TRT-LLM	TGI	llama.cpp
Language	Python + CUDA	Python + CUDA	C++ + CUDA	Rust + Python	C/C++
Key feature	PagedAttention	RadixAttention	FP8 + inflight batching	HF ecosystem	CPU/Apple Silicon
Production use	High	Growing	High (NVIDIA)	Moderate	Edge/local
Multi-GPU	TP + PP	TP + PP	TP + PP + EP	TP	Limited
Best for	General GPU serving	Prefix-heavy workloads	Max throughput NVIDIA	HF models	Laptops/edge

Bi-encoder vs Cross-encoder Ch 9

BM25 vs Dense vs Hybrid Ch 58

torch.compile vs ONNX Runtime vs TensorRT vs XLA Ch 130

	torch.compile	ONNX Runtime	TensorRT	XLA/JAX
Approach	Python-level tracing + Triton codegen	Graph-level optimization	Layer fusion + precision calibration	Whole-program HLO compilation
Dynamic shapes	Native	Native	Profiles (limited)	Recompile per shape
Setup cost	One line	Export to ONNX	Build engine (minutes)	jit-by-default
Hardware	NVIDIA + some CPU	CPU, CUDA, TensorRT, DirectML	NVIDIA only	TPU, GPU, CPU
Best for	Quick wins, custom ops	Cross-platform deploy	Max NVIDIA latency	Research, TPU workloads
LLM support	Yes (via vLLM/SGLang)	Limited for autoregressive	TRT-LLM (specialized)	Limited

GPU memory hierarchy Ch 129

	Registers	Shared / L1	L2 Cache	HBM
Capacity (H100)	~256 KB/SM	228 KB/SM	50 MB	80 GB
Bandwidth	~40 TB/s	~19 TB/s	~12 TB/s	3.35 TB/s
Latency	~0 cycles	~20 cycles	~200 cycles	~400 cycles
Managed by	Compiler	Programmer	Hardware	Hardware
ML relevance	Accumulators, indices	FlashAttention tiles	Weight reuse	Model weights, KV cache

Helm vs Kustomize vs CDK8s Ch 106

	Helm	Kustomize	CDK8s
Approach	Go templates	Overlay patching	Imperative code (TS/Py)
Readability	Low (template syntax)	High (plain YAML)	Medium (code)
Power	High	Medium	Highest
Ecosystem	Largest (charts)	Built into kubectl	Small
Best for	Third-party apps	Simple overlays	Complex generated manifests

Temporal vs Airflow vs Step Functions Ch 78

	Temporal	Airflow	Step Functions
Model	Durable execution	DAG scheduler	State machine (JSON)
Long-running	Yes (months)	No (task-level)	Yes (1 year max)
Language	Go/Java/Python/TS	Python	JSON/YAML
Self-hosted	Yes	Yes	No (AWS only)
Best for	Microservice orchestration	Data pipelines	AWS-native workflows