Comparison tables
Every "X vs Y" from the book in one place. The quick-reference for interviews.
Attention & Architecture
MHA vs MQA vs GQA vs MLA Ch 33
| MHA | MQA | GQA | MLA | |
|---|---|---|---|---|
| KV heads | H (= Q heads) | 1 | H/G groups | 1 latent |
| KV cache size | 1x (baseline) | H x smaller | G x smaller | Smallest (latent dim) |
| Quality | Best | Slight drop | Near-MHA | Near-MHA |
| Used by | GPT-3, BERT | PaLM | Llama 3, Mistral | DeepSeek-V2 |
Encoder-only vs Decoder-only vs Encoder-Decoder Ch 7
| Encoder-only | Decoder-only | Encoder-Decoder | |
|---|---|---|---|
| Mask | Bidirectional | Causal | Bidirectional + causal |
| Use case | Classification, embeddings | Generation, chat | Translation, summarization |
| Examples | BERT, RoBERTa | GPT, Llama, Mistral | T5, BART |
| Status (2025) | Niche (embeddings) | Dominant | Declining |
Training & Fine-tuning
DDP vs FSDP (ZeRO) Ch 12
| DDP | FSDP / ZeRO-3 | |
|---|---|---|
| Model replication | Full model on each GPU | Sharded across GPUs |
| Memory per GPU | Full model + optimizer | Shard only |
| Communication | All-reduce gradients | All-gather params + reduce-scatter grads |
| When to use | Model fits on 1 GPU | Model too large for 1 GPU |
| Throughput | Higher (less comm) | Lower (more comm, but scales) |
LoRA vs QLoRA vs Full fine-tuning Ch 15
| Full FT | LoRA | QLoRA | |
|---|---|---|---|
| Trainable params | All | ~0.1-1% | ~0.1-1% |
| Base model | bf16 | bf16 (frozen) | 4-bit NF4 (frozen) |
| Memory (70B) | ~1 TB+ | ~153 GB | ~48 GB |
| GPUs needed | 64+ H100s | 4-8 H100s | 1 A100-80GB |
| Quality | Best | ~95-99% of full | ~93-97% of full |
RLHF vs DPO vs KTO Ch 17
| RLHF (PPO) | DPO | KTO | |
|---|---|---|---|
| Models needed | 4 (policy, ref, reward, value) | 2 (policy, ref) | 2 (policy, ref) |
| Data format | Prompts + reward signal | Preference pairs | Binary thumbs up/down |
| Complexity | High (PPO tuning) | Low (one loss) | Lowest |
| Data requirement | Moderate | Paired comparisons | Unpaired signals |
| Default choice | No (unless online RL needed) | Yes | When pairs unavailable |
Inference
Prefill vs Decode Ch 21
| Prefill | Decode | |
|---|---|---|
| Input | Full prompt (many tokens) | 1 token per step |
| Bound by | Compute (high AI) | Memory bandwidth (AI~1) |
| Parallelism | Fully parallel across tokens | Sequential (autoregressive) |
| Latency metric | TTFT | TPOT |
| GPU utilization | High | Low (weight-loading dominated) |
Serving frameworks Ch 42
| vLLM | SGLang | TRT-LLM | TGI | llama.cpp | |
|---|---|---|---|---|---|
| Language | Python + CUDA | Python + CUDA | C++ + CUDA | Rust + Python | C/C++ |
| Key feature | PagedAttention | RadixAttention | FP8 + inflight batching | HF ecosystem | CPU/Apple Silicon |
| Production use | High | Growing | High (NVIDIA) | Moderate | Edge/local |
| Multi-GPU | TP + PP | TP + PP | TP + PP + EP | TP | Limited |
| Best for | General GPU serving | Prefix-heavy workloads | Max throughput NVIDIA | HF models | Laptops/edge |
Retrieval & RAG
Bi-encoder vs Cross-encoder Ch 9
| Bi-encoder | Cross-encoder | |
|---|---|---|
| Encoding | Query and doc independently | Query + doc jointly |
| Precomputation | Yes (doc embeddings cached) | No (forward pass per pair) |
| Speed | Fast (ANN lookup) | Slow (one pair at a time) |
| Quality | Good | Best |
| Use case | First-stage retrieval (top-100) | Reranking (top-100 to top-10) |
BM25 vs Dense vs Hybrid Ch 58
| BM25 | Dense | Hybrid (RRF) | |
|---|---|---|---|
| Signal | Exact term match | Semantic similarity | Both |
| Out-of-domain | Strong | Weak (needs fine-tuning) | Strong |
| Semantic understanding | None | Strong | Strong |
| Infrastructure | Inverted index | Vector DB + embeddings | Both |
| Quality (in-domain) | Baseline | Better | Best (+5-15 nDCG@10) |
GPU & Compilation
torch.compile vs ONNX Runtime vs TensorRT vs XLA Ch 130
| torch.compile | ONNX Runtime | TensorRT | XLA/JAX | |
|---|---|---|---|---|
| Approach | Python-level tracing + Triton codegen | Graph-level optimization | Layer fusion + precision calibration | Whole-program HLO compilation |
| Dynamic shapes | Native | Native | Profiles (limited) | Recompile per shape |
| Setup cost | One line | Export to ONNX | Build engine (minutes) | jit-by-default |
| Hardware | NVIDIA + some CPU | CPU, CUDA, TensorRT, DirectML | NVIDIA only | TPU, GPU, CPU |
| Best for | Quick wins, custom ops | Cross-platform deploy | Max NVIDIA latency | Research, TPU workloads |
| LLM support | Yes (via vLLM/SGLang) | Limited for autoregressive | TRT-LLM (specialized) | Limited |
GPU memory hierarchy Ch 129
| Registers | Shared / L1 | L2 Cache | HBM | |
|---|---|---|---|---|
| Capacity (H100) | ~256 KB/SM | 228 KB/SM | 50 MB | 80 GB |
| Bandwidth | ~40 TB/s | ~19 TB/s | ~12 TB/s | 3.35 TB/s |
| Latency | ~0 cycles | ~20 cycles | ~200 cycles | ~400 cycles |
| Managed by | Compiler | Programmer | Hardware | Hardware |
| ML relevance | Accumulators, indices | FlashAttention tiles | Weight reuse | Model weights, KV cache |
Infrastructure & DevOps
Helm vs Kustomize vs CDK8s Ch 106
| Helm | Kustomize | CDK8s | |
|---|---|---|---|
| Approach | Go templates | Overlay patching | Imperative code (TS/Py) |
| Readability | Low (template syntax) | High (plain YAML) | Medium (code) |
| Power | High | Medium | Highest |
| Ecosystem | Largest (charts) | Built into kubectl | Small |
| Best for | Third-party apps | Simple overlays | Complex generated manifests |
Temporal vs Airflow vs Step Functions Ch 78
| Temporal | Airflow | Step Functions | |
|---|---|---|---|
| Model | Durable execution | DAG scheduler | State machine (JSON) |
| Long-running | Yes (months) | No (task-level) | Yes (1 year max) |
| Language | Go/Java/Python/TS | Python | JSON/YAML |
| Self-hosted | Yes | Yes | No (AWS only) |
| Best for | Microservice orchestration | Data pipelines | AWS-native workflows |