Comparison tables

Every "X vs Y" from the book in one place. The quick-reference for interviews.

Attention & Architecture

MHA vs MQA vs GQA vs MLA Ch 33
MHAMQAGQAMLA
KV heads H (= Q heads) 1 H/G groups 1 latent
KV cache size 1x (baseline) H x smaller G x smaller Smallest (latent dim)
Quality Best Slight drop Near-MHA Near-MHA
Used by GPT-3, BERT PaLM Llama 3, Mistral DeepSeek-V2
Encoder-only vs Decoder-only vs Encoder-Decoder Ch 7
Encoder-onlyDecoder-onlyEncoder-Decoder
Mask Bidirectional Causal Bidirectional + causal
Use case Classification, embeddings Generation, chat Translation, summarization
Examples BERT, RoBERTa GPT, Llama, Mistral T5, BART
Status (2025) Niche (embeddings) Dominant Declining

Training & Fine-tuning

DDP vs FSDP (ZeRO) Ch 12
DDPFSDP / ZeRO-3
Model replication Full model on each GPU Sharded across GPUs
Memory per GPU Full model + optimizer Shard only
Communication All-reduce gradients All-gather params + reduce-scatter grads
When to use Model fits on 1 GPU Model too large for 1 GPU
Throughput Higher (less comm) Lower (more comm, but scales)
LoRA vs QLoRA vs Full fine-tuning Ch 15
Full FTLoRAQLoRA
Trainable params All ~0.1-1% ~0.1-1%
Base model bf16 bf16 (frozen) 4-bit NF4 (frozen)
Memory (70B) ~1 TB+ ~153 GB ~48 GB
GPUs needed 64+ H100s 4-8 H100s 1 A100-80GB
Quality Best ~95-99% of full ~93-97% of full
RLHF vs DPO vs KTO Ch 17
RLHF (PPO)DPOKTO
Models needed 4 (policy, ref, reward, value) 2 (policy, ref) 2 (policy, ref)
Data format Prompts + reward signal Preference pairs Binary thumbs up/down
Complexity High (PPO tuning) Low (one loss) Lowest
Data requirement Moderate Paired comparisons Unpaired signals
Default choice No (unless online RL needed) Yes When pairs unavailable

Inference

Prefill vs Decode Ch 21
PrefillDecode
Input Full prompt (many tokens) 1 token per step
Bound by Compute (high AI) Memory bandwidth (AI~1)
Parallelism Fully parallel across tokens Sequential (autoregressive)
Latency metric TTFT TPOT
GPU utilization High Low (weight-loading dominated)
Serving frameworks Ch 42
vLLMSGLangTRT-LLMTGIllama.cpp
Language Python + CUDA Python + CUDA C++ + CUDA Rust + Python C/C++
Key feature PagedAttention RadixAttention FP8 + inflight batching HF ecosystem CPU/Apple Silicon
Production use High Growing High (NVIDIA) Moderate Edge/local
Multi-GPU TP + PP TP + PP TP + PP + EP TP Limited
Best for General GPU serving Prefix-heavy workloads Max throughput NVIDIA HF models Laptops/edge

Retrieval & RAG

Bi-encoder vs Cross-encoder Ch 9
Bi-encoderCross-encoder
Encoding Query and doc independently Query + doc jointly
Precomputation Yes (doc embeddings cached) No (forward pass per pair)
Speed Fast (ANN lookup) Slow (one pair at a time)
Quality Good Best
Use case First-stage retrieval (top-100) Reranking (top-100 to top-10)
BM25 vs Dense vs Hybrid Ch 58
BM25DenseHybrid (RRF)
Signal Exact term match Semantic similarity Both
Out-of-domain Strong Weak (needs fine-tuning) Strong
Semantic understanding None Strong Strong
Infrastructure Inverted index Vector DB + embeddings Both
Quality (in-domain) Baseline Better Best (+5-15 nDCG@10)

GPU & Compilation

torch.compile vs ONNX Runtime vs TensorRT vs XLA Ch 130
torch.compileONNX RuntimeTensorRTXLA/JAX
Approach Python-level tracing + Triton codegen Graph-level optimization Layer fusion + precision calibration Whole-program HLO compilation
Dynamic shapes Native Native Profiles (limited) Recompile per shape
Setup cost One line Export to ONNX Build engine (minutes) jit-by-default
Hardware NVIDIA + some CPU CPU, CUDA, TensorRT, DirectML NVIDIA only TPU, GPU, CPU
Best for Quick wins, custom ops Cross-platform deploy Max NVIDIA latency Research, TPU workloads
LLM support Yes (via vLLM/SGLang) Limited for autoregressive TRT-LLM (specialized) Limited
GPU memory hierarchy Ch 129
RegistersShared / L1L2 CacheHBM
Capacity (H100) ~256 KB/SM 228 KB/SM 50 MB 80 GB
Bandwidth ~40 TB/s ~19 TB/s ~12 TB/s 3.35 TB/s
Latency ~0 cycles ~20 cycles ~200 cycles ~400 cycles
Managed by Compiler Programmer Hardware Hardware
ML relevance Accumulators, indices FlashAttention tiles Weight reuse Model weights, KV cache

Infrastructure & DevOps

Helm vs Kustomize vs CDK8s Ch 106
HelmKustomizeCDK8s
Approach Go templates Overlay patching Imperative code (TS/Py)
Readability Low (template syntax) High (plain YAML) Medium (code)
Power High Medium Highest
Ecosystem Largest (charts) Built into kubectl Small
Best for Third-party apps Simple overlays Complex generated manifests
Temporal vs Airflow vs Step Functions Ch 78
TemporalAirflowStep Functions
Model Durable execution DAG scheduler State machine (JSON)
Long-running Yes (months) No (task-level) Yes (1 year max)
Language Go/Java/Python/TS Python JSON/YAML
Self-hosted Yes Yes No (AWS only)
Best for Microservice orchestration Data pipelines AWS-native workflows