Appendix B: Reading list — The Holy Grail

The sources behind this book, curated. Papers, books, blog posts, and repositories, grouped by topic. Each entry has a one-line annotation: what it is, why to read it, and whether it’s actually worth your time or just hyped. Chapter links point to where the reference is used.

Not exhaustive. The real reading list for this field is a full-time job; this is the subset a senior interview candidate should actually have read or skimmed. If an entry is marked “skim only,” read the abstract and the figures and move on. If it’s marked “read cold,” that means you should be able to explain it on a whiteboard.

ML foundations and the transformer architecture

Vaswani et al., “Attention Is All You Need,” 2017. The transformer paper. Read cold. Chapters 6, 7.

Goodfellow, Bengio, Courville, Deep Learning, MIT Press, 2016. The foundational textbook. Chapters 2-6 cover the math background you’ll be asked about. Most of the book is dated for LLM work, but the first third is still the cleanest treatment of backprop and optimization. Chapters 2-4.

Bishop, Pattern Recognition and Machine Learning, Springer, 2006. The classical ML textbook. Read it if you’re shaky on probability; otherwise skim. Appendix C.

Andrej Karpathy, “The Unreasonable Effectiveness of Recurrent Neural Networks,” blog, 2015. Historical. Worth reading for the intuition on sequence models, not for the RNN specifics. Chapter 2.

Andrej Karpathy, “Let’s build GPT from scratch,” YouTube, 2023. The clearest walkthrough of building a small transformer in PyTorch. Read the code alongside the video. Chapters 6-8.

Jay Alammar, “The Illustrated Transformer,” blog, 2018. The visual explainer everyone links to. Worth the first read; not deep enough for the second. Chapter 6.

Lilian Weng, “Attention? Attention!” blog, 2018. A survey of attention variants. Still useful for the taxonomy. Chapters 6, 33.

Xiong et al., “On Layer Normalization in the Transformer Architecture,” 2020. Why pre-norm wins. Chapter 7.

Zhang and Sennrich, “Root Mean Square Layer Normalization,” 2019. The RMSNorm paper. Short and worth reading. Chapter 7.

Shazeer, “GLU Variants Improve Transformer,” 2020. The SwiGLU paper. Two pages. Read cold. Chapter 7.

Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding,” 2021. The RoPE paper. Read the position encoding section carefully; the rest is historical. Chapter 35.

Sennrich, Haddow, Birch, “Neural Machine Translation of Rare Words with Subword Units,” 2015. The BPE paper applied to NLP. Short and readable. Chapter 5.

Kudo, Richardson, “SentencePiece: A simple and language independent subword tokenizer,” 2018. The SentencePiece paper. Skim the abstract and figures. Chapter 14.

Pretraining, scaling laws, and training data

Kaplan et al., “Scaling Laws for Neural Language Models,” 2020. The first scaling laws paper. Historically important; its conclusions were partially wrong (undercounted data). Chapter 11.

Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla), 2022. Read cold. The 20-tokens-per-parameter rule and its derivation. Chapter 11.

Brown et al., “Language Models are Few-Shot Learners” (GPT-3), 2020. Read the results sections, skim the rest. Chapter 8.

Touvron et al., “LLaMA: Open and Efficient Foundation Language Models,” 2023. The first truly open strong base model. The recipe section is useful; ignore the benchmarks. Chapters 11, 13.

Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” 2023. The longer paper. RLHF section is the most useful part. Chapter 17.

Grattafiori et al., “The Llama 3 Herd of Models,” 2024. The 92-page everything-paper. Read the data, compute, and training sections; skim evaluations. Chapters 11, 12.

DeepSeek-AI, “DeepSeek-V2 Technical Report,” 2024. For MLA. Read the attention section carefully. Chapter 33.

DeepSeek-AI, “DeepSeek-V3 Technical Report,” 2024. Read it twice. The reference example of a top-tier MoE trained on a realistic budget. Chapters 12, 33, 34.

Penedo et al., “The RefinedWeb Dataset for Falcon LLM,” 2023. The data-quality paper. Read it for the deduplication and filtering techniques. Chapter 11.

Rae et al., “Scaling Language Models: Methods, Analysis & Insights from Training Gopher,” 2021. DeepMind’s scale paper before Chinchilla. Mostly of historical interest now. Chapter 11.

Distributed training

Rajbhandari et al., “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models,” 2019. Read cold. Chapter 12.

Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,” 2019. The tensor parallelism paper. Read it for the column/row partitioning math. Chapter 28.

Narayanan et al., “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM,” 2021. The combined DP+TP+PP paper. Useful for the interplay. Chapter 12.

Huang et al., “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,” 2018. The pipeline parallelism paper. Chapter 28.

Korthikanti et al., “Reducing Activation Recomputation in Large Transformer Models,” 2022. Selective activation checkpointing. Chapter 12.

Zhao et al., “PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel,” 2023. The FSDP paper by the PyTorch team. Read it if you care about the implementation details. Chapter 12.

Micikevicius et al., “Mixed Precision Training,” 2017. The FP16 + loss scaling paper. Foundational. Chapter 13.

NVIDIA, “FP8 Formats for Deep Learning,” white paper, 2022. The E4M3/E5M2 split and why. Chapter 13.

Fine-tuning and alignment

Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” 2021. Read cold. Chapter 15.

Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” 2023. Read cold. Chapter 15.

Houlsby et al., “Parameter-Efficient Transfer Learning for NLP,” 2019. The original adapter paper. Mostly of historical interest; read LoRA instead. Chapter 15.

Ouyang et al., “Training language models to follow instructions with human feedback” (InstructGPT), 2022. Read cold. The RLHF recipe. Chapter 17.

Rafailov et al., “Direct Preference Optimization: Your Language Model is Secretly a Reward Model,” 2023. Read cold. The derivation of DPO is elegant. Chapter 17.

Ethayarajh et al., “KTO: Model Alignment as Prospect Theoretic Optimization,” 2024. The KTO paper. Skim. Chapter 17.

Bai et al., “Constitutional AI: Harmlessness from AI Feedback,” 2022. Anthropic’s CAI paper. Worth reading for the self-critique mechanism. Chapter 17.

Taori et al., “Stanford Alpaca: An Instruction-following LLaMA Model,” blog/repo, 2023. The self-instruct recipe that kicked off open instruction tuning. Skim the blog; don’t bother with the data. Chapter 16.

Wang et al., “Self-Instruct: Aligning Language Models with Self-Generated Instructions,” 2023. The paper behind Alpaca. Chapter 19.

Köpf et al., “OpenAssistant Conversations,” 2023. The OASST dataset paper. Good for understanding how chat datasets are actually built. Chapter 16.

Chung et al., “Scaling Instruction-Finetuned Language Models” (FLAN), 2022. The instruction tuning scaling paper. Chapter 16.

Distillation, quantization, and compression

Hinton, Vinyals, Dean, “Distilling the Knowledge in a Neural Network,” 2015. The original distillation paper. Two pages. Read cold. Chapter 18.

Sanh et al., “DistilBERT,” 2019. The BERT distillation paper. Historical. Chapter 18.

Frantar and Alistarh, “SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot,” 2023. The one-shot pruning paper. Read it for the pruning math. Chapter 18.

Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” 2022. Read cold. Chapter 26.

Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” 2023. Read cold. Chapter 26.

Xiao et al., “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models,” 2022. The outlier-redistribution trick. Chapter 26.

Dettmers et al., “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale,” 2022. The outlier-aware INT8 paper. The motivation section is the most useful part. Chapter 26.

NVIDIA, “NVIDIA Hopper Architecture Whitepaper,” 2022. For the FP8 story and the Transformer Engine. Skim. Appendix D.

Inference internals and serving

Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (vLLM), 2023. Read cold. Chapter 24.

Yu et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models,” 2022. Read cold. Continuous batching. Chapter 23.

Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” 2022. Read cold. Chapter 25.

Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” 2023. The FA2 improvements. Skim. Chapter 25.

Shah et al., “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision,” 2024. The Hopper-specific version. Skim unless you write kernels. Chapter 25.

Zheng et al., “SGLang: Efficient Execution of Structured Language Model Programs,” 2024. RadixAttention. Read the caching section carefully. Chapter 29.

Leviathan, Kalman, Matias, “Fast Inference from Transformers via Speculative Decoding,” 2022. The original speculative decoding paper. Read cold. Chapter 27.

Cai et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads,” 2024. Chapter 27.

Li et al., “EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty,” 2024. Self-speculative. Chapter 27.

Zhong et al., “DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving,” 2024. Read cold. Chapter 36.

Patel et al., “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” 2024. Read alongside DistServe. Chapter 36.

Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints,” 2023. Read cold. Chapter 33.

Shazeer, “Fast Transformer Decoding: One Write-Head is All You Need,” 2019. The MQA paper. Two pages. Historical but worth reading. Chapter 33.

Pope et al., “Efficiently Scaling Transformer Inference,” 2022. Google’s inference scaling paper. Useful for the TP and PP math. Chapter 28.

Liu et al., “Ring Attention with Blockwise Transformers for Near-Infinite Context,” 2023. Ring attention. Chapter 35.

Chen et al., “Extending Context Window of Large Language Models via Positional Interpolation,” 2023. PI. Chapter 35.

Peng et al., “YaRN: Efficient Context Window Extension of Large Language Models,” 2023. Read for RoPE scaling. Chapter 35.

State-space and hybrid architectures

Gu, Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” 2023. Read cold. Chapter 41.

Dao, Gu, “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality” (Mamba-2), 2024. Skim. Chapter 41.

Gu et al., “S4: Efficiently Modeling Long Sequences with Structured State Spaces,” 2021. The prequel to Mamba. Optional. Chapter 41.

MoE

Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” 2017. The original MoE-for-deep-learning paper. Read cold. Chapter 34.

Fedus et al., “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,” 2021. Top-1 routing, load balancing. Chapter 34.

Jiang et al., “Mixtral of Experts,” 2024. The Mixtral technical report. Short and clear. Chapter 34.

Reasoning and test-time compute

Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” 2022. Read cold. Chapter 42.

Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in Language Models,” 2022. Chapter 42.

Yao et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,” 2023. Skim. Chapter 42.

OpenAI, “Learning to Reason with LLMs” (o1 blog post), 2024. The o1 announcement. Short, nontechnical, worth reading. Chapter 42.

DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” 2025. Read cold. The first public reasoning-model training recipe. Chapter 42.

Retrieval and RAG

Robertson, Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond,” Foundations and Trends, 2009. The canonical BM25 reference. Dense but authoritative. Chapter 57.

Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering” (DPR), 2020. The first widely-used dense retriever. Chapter 58.

Khattab, Zaharia, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT,” 2020. Read for the late-interaction idea. Chapter 9.

Malkov, Yashunin, “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs” (HNSW), 2016. Read cold if you touch vector search. Chapter 59.

Jégou, Douze, Schmid, “Product Quantization for Nearest Neighbor Search,” 2011. The PQ paper. Still the best intro. Chapter 59.

Johnson, Douze, Jégou, “Billion-scale similarity search with GPUs” (FAISS), 2017. The FAISS paper. Chapter 59.

Guo et al., “Accelerating Large-Scale Inference with Anisotropic Vector Quantization” (ScaNN), 2020. Chapter 59.

Thakur et al., “BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models,” 2021. The BEIR benchmark paper. The motivation section is the most useful. Chapter 64.

Muennighoff et al., “MTEB: Massive Text Embedding Benchmark,” 2022. Chapter 58.

Cormack, Clarke, Büttcher, “Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods,” SIGIR 2009. The RRF paper. Three pages. Read cold. Chapter 60.

Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (RAG), 2020. The paper that named the pattern. Mostly of historical interest now. Chapter 65.

Gao et al., “Precise Zero-Shot Dense Retrieval without Relevance Labels” (HyDE), 2022. Chapter 63.

Es et al., “Ragas: Automated Evaluation of Retrieval Augmented Generation,” 2023. Read if you build RAG eval pipelines. Chapter 64.

Agents and tool use

Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” 2022. Read cold. Chapter 67.

Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning,” 2023. Chapter 67.

Wang et al., “Voyager: An Open-Ended Embodied Agent with Large Language Models,” 2023. For the skill-library idea. Chapter 68.

Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools,” 2023. The self-teaching tool-use paper. Chapter 66.

Anthropic, “Introducing the Model Context Protocol,” 2024. The MCP announcement. Read alongside the spec. Chapter 69.

Anthropic, “Model Context Protocol specification,” docs, 2024–. Read the transport and primitives sections. Chapter 69.

Wu et al., “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation,” 2023. The multi-agent framework paper. Skim. Chapter 68.

LangChain documentation. Useful as a library reference, not as a design document. Read the code, not the marketing. Chapter 67.

Simon Willison’s blog, simonwillison.net. Not a reference but the best running commentary on LLM tools and prompt injection. Read whenever a new agent disaster happens. Chapter 71.

Distributed systems, reliability, infrastructure

Kleppmann, Designing Data-Intensive Applications, O’Reilly, 2017. Read cold. The single most important book on this list for systems interviews. Every chapter in Part VI assumes you know it. Chapters 71-89.

Tanenbaum, van Steen, Distributed Systems, 3rd ed., 2017. The textbook. Use as a reference for specific topics; don’t try to read cover to cover. Chapter 73.

Beyer, Jones, Petoff, Murphy (eds.), Site Reliability Engineering, O’Reilly, 2016. The Google SRE book. Read chapters 1-6, 10, 13. Chapter 97.

Beyer et al., The Site Reliability Workbook, O’Reilly, 2018. The practical companion. Read the SLO chapter cold. Chapter 97.

Brendan Gregg, Systems Performance, 2nd ed., Addison-Wesley, 2020. The reference for performance work. The USE method and the tools chapters are the most useful. Chapter 92.

Nygard, Release It!, 2nd ed., Pragmatic Bookshelf, 2018. Stability patterns (circuit breaker, bulkhead, back-pressure). Read cold. Chapter 77.

Fowler, Patterns of Enterprise Application Architecture, Addison-Wesley, 2002. The patterns catalog. Some terms on your interviewer’s vocabulary list come from here.

Newman, Building Microservices, 2nd ed., O’Reilly, 2021. Decent, not mandatory. Skim. Chapter 73.

Burns, Beda, Hightower, Kubernetes Up & Running, 3rd ed., O’Reilly, 2022. The Kubernetes intro. Skip if you already use K8s daily. Chapter 102.

Morris, Infrastructure as Code, 2nd ed., O’Reilly, 2020. The IaC book. Skim the first few chapters. Chapter 110.

Kim, Humble, Debois, Willis, The DevOps Handbook, 2nd ed., IT Revolution, 2021. Cultural, not technical. Optional.

Weaveworks, “GitOps” documentation. The philosophy post that defined the term. Read the original short version. Chapter 107.

Observability

Majors, Fong-Jones, Miranda, Observability Engineering, O’Reilly, 2022. The honest book about observability. Read it cold. Chapter 92.

Sridharan, Distributed Systems Observability, O’Reilly free report, 2018. The short early version. Free. Read it first. Chapter 92.

Tom Wilkie, “The RED Method: Key Metrics for Microservices Architecture,” Weaveworks blog, 2018. Chapter 92.

Prometheus documentation, prometheus.io. Read the PromQL section carefully; the rest is reference. Chapter 93.

OpenTelemetry documentation, opentelemetry.io. Reference. The tracing and context propagation sections are the useful parts. Chapter 95.

Build, deploy, operate

Rice, Container Security, O’Reilly, 2020. Read the chapters on cgroups, namespaces, and rootless containers. Chapter 102.

Hightower, “Kubernetes the Hard Way,” GitHub tutorial. Walk through it once. You’ll never forget what’s inside a control plane. Chapter 102.

Brazil, Prometheus: Up & Running, 2nd ed., O’Reilly, 2022. The Prometheus reference. Skim unless you operate Prometheus. Chapter 93.

Argo CD documentation, argo-cd.readthedocs.io. Read the App-of-Apps section. Chapter 107.

Helm documentation, helm.sh. Reference. Chapter 108.

Bazel documentation, bazel.build. Reference. The concepts (hermeticity, remote cache) are more important than the details. Chapter 101.

GPU hardware and networking

NVIDIA, “NVIDIA H100 Tensor Core GPU Architecture” whitepaper, 2022. Read the HBM and Transformer Engine sections. Appendix D.

NVIDIA, “NVIDIA Blackwell Architecture Technical Brief,” 2024. The B200 and GB200 details. Appendix D.

NVIDIA, NVLink and NVSwitch documentation. Reference. Appendix D.

InfiniBand Trade Association, “InfiniBand Architecture Specification.” Reference only. Don’t try to read it end to end. Appendix D.

Chip Huyen, Designing Machine Learning Systems, O’Reilly, 2022. Broader than this book in some places, shallower in others. Skim.

ML system design and interview prep

Alex Xu, Machine Learning System Design Interview, self-published, 2023. The only ML-interview-specific book. Shallow but covers the vocabulary. Use as a flashcard deck. Part X.

Alex Xu, System Design Interview, 2020. The classical SDI book. Not ML-specific. Worth owning. Part X.

Donne Martin, “system-design-primer” GitHub repo. Comprehensive but dated. Skim for topic breadth. Part X.

Chip Huyen’s blog, huyenchip.com. Consistently good on ML systems practice. Chapter 42.

Sebastian Raschka’s blog, sebastianraschka.com. Solid explainers on training and quantization topics. Chapters 15, 26.

Lilian Weng’s blog, lilianweng.github.io. Long-form surveys that are often the best intro to a subfield. Chapters 17, 40, 65.

Running worth-reading repos

vllm-project/vllm. Read the vllm/core/scheduler.py and vllm/worker/worker.py files at least once. Chapter 48.

sgl-project/sglang. Read the RadixAttention implementation. Chapter 29.

huggingface/text-embeddings-inference. Read the batching logic. Chapter 49.

huggingface/text-generation-inference (TGI). The early-generation server, still useful as a reference point. Chapter 44.

NVIDIA/TensorRT-LLM. Read the README and example configs; the internals are NVIDIA-proprietary and hard to follow. Chapter 44.

ggerganov/llama.cpp. Read the README and the quantization formats doc. Chapter 44.

pytorch/pytorch. Read torch/distributed/fsdp once to understand how sharding actually works. Chapter 12.

Dao-AILab/flash-attention. Read the CUTLASS templates at least to the point of knowing what they do. Chapter 25.

openai/triton. Read the Python DSL tutorials. Chapter 38.

microsoft/DeepSpeed. The reference implementation of ZeRO. Chapter 12.

facebookresearch/faiss. The reference vector index library. Chapter 59.

elastic/elasticsearch. The reference hybrid search engine. Read the BM25 implementation. Chapters 55, 58.

kserve/kserve. Read the InferenceService CRD and the autoscaler hooks. Chapter 47.

kedacore/keda. Read the Prometheus scaler. Chapter 51.

envoyproxy/envoy. Reference only. Enormous. Chapter 73.

open-telemetry/opentelemetry-specification. The spec. Reference. Chapter 95.

prometheus/prometheus. Read the TSDB design docs. Chapter 93.

temporalio/temporal. Read the “workflow vs activity” docs. Chapter 80.

Things you can skip (hyped but not worth cold reading)

Most LangChain and LlamaIndex documentation. Both useful as libraries; neither is a design document.

Most “awesome-*” lists on GitHub. Too shallow to be worth your time except for navigation.

Most prompt engineering “guides.” They age badly. Read the actual LLM papers instead.

Medium blog posts claiming to explain transformers. 90% are wrong. Read Karpathy or the original paper.

Any “LLM from scratch in 100 lines” article. Usually teaches you nothing that Karpathy doesn’t teach better.

Marketing whitepapers from vector DB vendors. Read the open papers (HNSW, PQ, ScaNN) instead.

“State of AI” reports. Entertainment, not education.

Deeper systems reading

Tanenbaum, Bos, Modern Operating Systems, 4th ed., Pearson, 2014. For the namespaces, scheduler, and virtual memory sections. Chapter 102.

Love, Linux Kernel Development, 3rd ed., Addison-Wesley, 2010. Dated but still the clearest intro to Linux internals. Chapter 102.

Bryant, O’Hallaron, Computer Systems: A Programmer’s Perspective, 3rd ed., Pearson, 2015. The book that teaches you what your compiler is actually doing. Worth reading once.

Henderson, Scaling Social Science, blog, and the Will Larson writings. For organizational scaling discussions that occasionally show up in staff-level interviews. Optional.

Hohpe, Woolf, Enterprise Integration Patterns, Addison-Wesley, 2003. The messaging-patterns bible. Skim the first half for vocabulary. Chapter 84.

Abadi, “Consistency Tradeoffs in Modern Distributed Database System Design,” IEEE Computer, 2012. The PACELC framing — CAP only during partitions, latency/consistency always. Read for the mental model. Chapter 86.

Brewer, “CAP Twelve Years Later: How the Rules Have Changed,” IEEE Computer, 2012. The follow-up from the guy who named CAP. Short. Read cold.

Vogels, “Eventually Consistent,” CACM, 2009. Amazon’s seminal eventual-consistency piece. Read cold.

Additional ML systems blog sources

Horace He, “Making Deep Learning Go Brrrr From First Principles,” blog, 2022. The clearest piece on compute vs memory bound reasoning in deep learning. Read cold. Chapters 21, 25.

Stas Bekman, “Machine Learning Engineering” open book (GitHub stas00/ml-engineering). A running collection of training and inference notes, especially strong on distributed training gotchas. Chapters 12, 13.

HuggingFace blog, huggingface.co/blog. Hit-or-miss. When a post is by Tim Dettmers, Philipp Schmid, or Quentin Lhoest, read it. Otherwise skim.

vLLM blog, blog.vllm.ai. Short-form posts on what’s new in vLLM. Read the ones about prefix caching and disaggregated PD. Chapters 29, 36.

Eleuther AI blog. Strong on training techniques and scaling experiments. Chapters 11, 12.

Google Research blog. Occasionally posts high-quality ML systems content (Pathways, Gemini architecture hints, scaling work).

Meta AI blog. Ditto for Llama, FSDP, and PyTorch-specific work.

Anthropic research. Short but dense posts on alignment and mechanistic interpretability. Read the ones on “features” and “circuits” for mechanistic depth; skip the rest unless you’re specifically interested in alignment research. Chapter 17.

More MoE and attention variants

Zhou et al., “Mixture-of-Experts with Expert Choice Routing,” 2022. An alternative to the standard top-k routing that avoids dropped tokens. Worth knowing about. Chapter 34.

Dai et al., “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,” 2024. Fine-grained experts plus shared experts. Read for the DeepSeek-V3 background. Chapter 34.

Rajbhandari et al., “DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale,” 2022. The MoE systems paper from DeepSpeed. Chapter 34.

Benchmarks and evaluation

Hendrycks et al., “Measuring Massive Multitask Language Understanding” (MMLU), 2020. The original MMLU paper. Chapter 20.

Chen et al., “Evaluating Large Language Models Trained on Code” (HumanEval), 2021. The Codex paper with the HumanEval benchmark. Chapter 20.

Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” 2023. The LLM-as-judge methodology paper. Read cold if you build evals. Chapter 20.

Dubois et al., “AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback,” 2023. Chapter 20.

Liang et al., “Holistic Evaluation of Language Models” (HELM), 2022. The Stanford CRFM evaluation framework. Read for the breadth of metrics; most people never use HELM directly. Chapter 20.

Retrieval deeper cuts

Luan et al., “Sparse, Dense, and Attentional Representations for Text Retrieval,” 2020. The theoretical comparison paper. Worth skimming for the bounds. Chapter 58.

Santhanam et al., “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction,” 2021. Chapter 9.

Nogueira, Cho, “Passage Re-ranking with BERT,” 2019. The first widely-used cross-encoder reranker paper. Chapter 62.

Xiong et al., “Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval” (ANCE), 2020. The hard negative mining paper. Chapter 58.

Izacard, Grave, “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering” (Fusion-in-Decoder), 2020. An early generation-with-retrieval paper. Chapter 65.

Agents and RL background

Sutton, Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018. The RL textbook. Read chapters 1-5 if you want PPO to make sense. Skim otherwise. Chapter 17.

Schulman et al., “Proximal Policy Optimization Algorithms” (PPO), 2017. The PPO paper. Read if you want to understand the RLHF machinery; skip if you’re only going to use DPO. Chapter 17.

Christiano et al., “Deep Reinforcement Learning from Human Preferences,” 2017. The paper that first trained a reward model from human preferences for RL. The conceptual ancestor of RLHF. Chapter 17.

Kernel and systems deeper cuts

Chen et al., “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,” 2018. TVM’s paper. Read if you care about compilation stacks; skim otherwise. Chapter 38.

Tillet, Kung, Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” 2019. The original Triton paper. Chapter 38.

NVIDIA, “CUDA C++ Programming Guide.” Reference. Read the sections on shared memory, warps, and memory coalescing. Chapter 38.

Observability deeper cuts

Turnbull, The Logstash Book, self-published, 2013. Historical. Skip unless you’re specifically working with ELK.

Fong-Jones and Sridharan, various talks. The best practitioners’ conference talks on observability in production. Search their names on YouTube. Chapter 92.

Henri Binsztok et al., “Scuba: Diving into Data at Facebook,” VLDB 2013. The Scuba paper. Read for the interactive-observability model, a different paradigm from time-series.

If you read the read cold entries above and work through Kleppmann once, you have read more than 90% of the candidates you will be interviewing against. The rest is practice: pick a topic in the glossary, find the paper or chapter here, read it, and try to explain it cold.