Appendices
Appendix A ~38 min read

Glossary

Every term in this book collected in one place. Entries are alphabetical. Each is a short paragraph giving the definition needed to read the rest of the book, with a pointer to the chapter where the concept is developed in depth. If a term has multiple common meanings, the definition reflects how the book uses it.

Use this as a lookup, not a study guide. The chapters are where the reasoning lives; this is where you come when a term is blocking you on page 200.


A/B test — A controlled experiment where traffic is split between two variants and a metric is compared. In ML serving, used for comparing model versions, prompt templates, or retrieval configurations. Not the same as a canary (which is about safety of rollout, not statistical comparison). See Chapter 98.

Activation checkpointing — Also called gradient checkpointing. A training memory optimization that discards intermediate activations during the forward pass and recomputes them during the backward pass, trading roughly 30% more compute for dramatically lower activation memory. The standard technique for training larger models on fixed hardware. See Chapter 12.

Active parameters — For an MoE model, the number of parameters actually touched during a forward pass for a single token. A Mixtral 8x7B model has ~47B total parameters but ~13B active per token. Inference cost scales with active parameters, not total parameters; HBM footprint scales with total. See Chapter 34.

AdamW — The default optimizer for training LLMs. Adam with decoupled weight decay. Stores two optimizer state tensors per parameter (first and second moments), costing 8 bytes per parameter in fp32 — which is why a 70B model needs 560 GB of optimizer state even before gradients or activations. See Chapter 4.

Adapter — A small trainable module inserted into a frozen pretrained model. LoRA is the dominant adapter form today; older forms (bottleneck adapters, prefix tuning) are mostly historical. See Chapter 15.

Admission control — The gate that decides whether to accept a new request at all, as distinct from rate limiting (how fast requests flow) and load shedding (dropping already-accepted work). The cleanest defense against queue buildup under overload. See Chapter 77.

AI gateway — A reverse proxy specifically for LLM traffic. Speaks the OpenAI API shape, routes by the model field, handles rate limiting, key management, request transformation, and observability. Envoy AI Gateway, LiteLLM, Portkey, Kong AI Gateway are representative. See Chapter 50.

Alignment — The post-training stage that shapes a base model’s behavior toward human preferences and safety norms. Currently implemented via SFT plus preference learning (RLHF, DPO, KTO). Not a solved problem. See Chapter 17.

All-reduce — A collective operation where every participant contributes a tensor and ends up with the elementwise sum (or other reduction). The communication primitive behind data-parallel training. Ring all-reduce is the standard efficient implementation. See Chapter 12.

Ancestral sampling — Drawing tokens one at a time from the model’s predicted distribution. Temperature sampling, top-k, top-p are all ancestral. As opposed to beam search, which explores multiple futures. See Chapter 8.

ANN (approximate nearest neighbor) search — Finding vectors close to a query vector without scanning the whole corpus. HNSW and IVF are the two dominant algorithm families. See Chapter 59.

Arena-Hard — A benchmark built from real user queries scored by LLM judges. One of the more reliable open leaderboards precisely because the questions come from actual usage. See Chapter 20.

Arithmetic intensity — The ratio of floating-point operations to bytes of memory traffic for a kernel. A kernel with low arithmetic intensity is memory-bound; one with high intensity is compute-bound. The roofline model uses this directly. See Chapter 25.

Attention — The mechanism where each token looks at every other token through a softmax over Q·K dot products, weighted by V. O(n²) in sequence length. The core operation of the transformer. See Chapter 6.

Autograd — The system that records forward-pass operations on a computation graph and uses the chain rule to compute gradients during the backward pass. PyTorch’s autograd is dynamic (graph built each forward pass); TensorFlow 1’s was static. See Chapter 3.

AWQ (activation-aware weight quantization) — A post-training quantization scheme that preserves the most salient weight channels (those interacting with large activations) at higher precision. Dominant for INT4 weight quantization of open-weight LLMs. See Chapter 26.

Backpressure — Pushing resistance back up the chain when a downstream component is saturated, rather than letting work pile up. The alternative to load shedding and the alternative to dropping. Reactive Streams is the canonical programming model. See Chapter 77.

Backward pass — The gradient computation that follows a forward pass. Walks the computation graph in reverse, applying the chain rule at each node. See Chapter 3.

Batch — A group of examples processed together. For training, batching amortizes optimizer and gradient work. For inference, batching amortizes memory reads of the model weights across multiple tokens. See Chapters 4, 23.

Batch normalization — Normalizing activations across the batch dimension. Standard in older CNNs; almost never used in transformers because of the batch-size sensitivity. LayerNorm and RMSNorm replaced it. See Chapter 7.

Beam search — A decoding strategy that keeps the top-k partial sequences at each step, expanding each and keeping the best k. Standard in old seq2seq translation; dead for LLMs because it causes repetition and produces bland text. See Chapter 8.

Bi-encoder — An embedding model where query and document are encoded independently and compared by a dot product. Cheap because you can precompute document embeddings. As opposed to a cross-encoder. See Chapter 9.

BF16 (bfloat16) — 16-bit float with 8 exponent bits and 7 mantissa bits. Same exponent range as FP32, which is why it rarely needs loss scaling. The default precision for LLM training on modern hardware. See Chapter 13.

Block size — In PagedAttention, the number of tokens whose KV cache lives in a single physical block. vLLM defaults to 16. Small blocks reduce internal fragmentation but raise index overhead. See Chapter 24.

BM25 — A classical lexical retrieval function combining term frequency saturation and inverse document frequency. Still a hard baseline for dense retrievers to beat, especially on rare terms and exact matches. See Chapter 57.

BPE (byte pair encoding) — A subword tokenization algorithm that iteratively merges the most frequent byte or character pair. The basis for GPT, Llama, and most open-weight models’ tokenizers. See Chapter 5.

Broadcasting — Numpy/PyTorch’s rule for letting operations apply across tensors of compatible but non-identical shapes by virtually expanding the smaller tensor. Source of 80% of shape bugs. See Chapter 1.

Canary deployment — Rolling out a new version to a small fraction of traffic first, then gradually shifting if it looks healthy. The safety mechanism for continuous delivery. See Chapter 98.

Causal mask — The attention mask that prevents a token from attending to future tokens, making training of decoder-only LLMs tractable by computing all positions in parallel. An upper-triangular matrix of negative infinities added before softmax. See Chapter 6.

Cell architecture — A deployment topology where independent copies of the full stack are run per region, per tenant, or per shard, so that a failure is bounded to one cell. The main tool for blast-radius control. See Chapter 109.

Chain of thought (CoT) — A prompting or training technique where the model produces intermediate reasoning tokens before a final answer. The basis for the reasoning-model generation (o1, R1, Gemini Thinking). See Chapter 42.

Chinchilla scaling laws — Hoffmann et al.’s 2022 result that compute-optimal training uses roughly 20 tokens per parameter, correcting earlier (overly parameter-heavy) scaling laws. See Chapter 11.

Chunking — Splitting long documents into retrievable pieces. Strategies range from fixed-size to semantic to hierarchical parent-child. Underrated lever in RAG quality. See Chapter 61.

Classifier-free guidance — A diffusion-model technique for steering generation without a separate classifier. Out of scope for most LLM work but worth recognizing.

Cold start — The latency cost of bringing a replica from zero to serving. For LLMs, dominated by model weight loading (tens of GB) rather than framework startup. Solved by pre-caching weights on node-local NVMe. See Chapter 52.

ColBERT — A late-interaction retrieval model: each query token and each document token get their own embedding, and the score is the sum of the best per-token matches. Higher quality than bi-encoders but much more memory. See Chapter 9.

Collective operation — A communication primitive that involves all (or a subgroup of) participating processes: all-reduce, all-gather, reduce-scatter, broadcast, all-to-all. The vocabulary of distributed training. See Chapter 12.

CUDA — NVIDIA’s programming model for GPUs. A CUDA kernel is a function that runs on thousands of threads in parallel across SMs. Most ML engineers never write raw CUDA but understanding the execution model explains why FlashAttention, PagedAttention, and kernel fusion work. See Chapter 39.

CUDA stream — A queue of GPU operations that execute in order. Multiple streams allow overlapping compute and data transfer. PyTorch uses a default stream; non-default streams are used for pipeline overlap. See Chapter 39.

CUTLASS — NVIDIA’s C++ template library for writing high-performance GEMM and attention kernels. The layer between raw CUDA and frameworks like Triton. FlashAttention 3 is built on CUTLASS. See Chapters 38, 129.

Context length — The maximum number of tokens the model attends to in one forward pass. Limited by positional encoding and by KV cache memory in practice. Modern long-context models use RoPE with NTK or YaRN scaling. See Chapter 35.

Continuous batching — Token-level scheduling where new requests can join an in-flight batch as old ones finish, rather than waiting for the whole batch to complete. The Orca technique, now universal in production serving. See Chapter 23.

Continuous profiling — Always-on sampling profiler running in production, producing flame graphs on demand. Pyroscope and Parca are the open-source tools. eBPF-based. The fourth pillar of observability. See Chapter 96.

Contrastive learning — Training an embedder by pulling positive pairs together and pushing negative pairs apart in embedding space. The basis for modern dense retrieval. See Chapter 9.

Control plane — The set of components that manage configuration, orchestration, and policy, as distinct from the data plane that handles actual traffic. Kubernetes is a control plane. See Chapter 46.

Cross-encoder — A model where query and document are concatenated and fed through a transformer together, yielding a single relevance score. Cannot precompute document representations, so it’s only used as a reranker over a small candidate list. See Chapter 62.

Cross-entropy — The loss function used to train LLMs and classifiers. The negative log-likelihood of the target token under the model’s distribution, averaged over positions. Equivalent to maximum likelihood estimation. See Chapter 4.

CUDA graph — A captured sequence of CUDA operations that can be replayed with minimal host overhead. Critical for low-latency decode, where the per-token Python overhead would otherwise dominate. See Chapter 38.

CUTLASS — NVIDIA’s templated CUDA library of high-performance GEMM kernels. The building block most custom kernels compose. See Chapter 38.

Data parallelism — Copying the full model to each worker and splitting the batch across workers. The simplest form of distributed training. DDP in PyTorch. See Chapter 12.

Decode — The autoregressive token-by-token generation phase of LLM inference, as opposed to prefill. Memory-bandwidth-bound, sequential within a request, low arithmetic intensity. See Chapter 21.

DDP (distributed data parallel) — PyTorch’s data-parallel training implementation. Gradient all-reduce after every backward pass. See Chapter 12.

Dense retrieval — Retrieval by nearest-neighbor search in a learned embedding space. As opposed to sparse/lexical retrieval (BM25). See Chapter 58.

Dequantization — Converting quantized weights back to a higher-precision format before use. The standard way weight-only quantization works during matmul: load INT4, dequantize to BF16 in registers, multiply. See Chapter 26.

Disaggregated prefill/decode — Running the prefill and decode phases on different GPUs, with KV cache transferred between them over NVLink or RDMA. Wins big for VLM workloads and latency-sensitive short-context text; loses for pure throughput. See Chapter 36.

Distillation — Training a small student model to match the output distribution of a larger teacher model. Hard-label distillation is just training on teacher outputs; soft-label distillation uses the full teacher logits. See Chapter 18.

DPO (direct preference optimization) — A preference-learning method that trains directly on pairs of (preferred, dispreferred) responses using a cleverly constructed loss, without the explicit reward model and PPO loop of RLHF. Simpler and more stable than RLHF, now the default. See Chapter 17.

Dropout — Randomly zeroing a fraction of activations during training for regularization. Mostly unused in modern LLM pretraining (the data is too diverse to overfit to). See Chapter 7.

Early exit — A strategy where a model can produce an answer after only some layers have run. Rarely used in practice for LLMs. See Chapter 42.

EAGLE — A self-speculative decoding scheme where the base model itself provides draft tokens via a lightweight auxiliary head. Higher acceptance rates than generic draft models. See Chapter 27.

Embedding — A dense vector representation of an input. Learned by the model. For retrieval, the output of an embedder; for LLMs, the first layer lookup that turns a token ID into a vector. See Chapters 7, 9.

Encoder — A transformer block stack that produces embeddings, as opposed to a decoder that produces tokens. BERT is an encoder; GPT is a decoder. See Chapter 7.

Epoch — One full pass over the training data. Modern LLM pretraining uses less than one epoch on very large corpora, so the term has lost most of its meaning there. See Chapter 11.

Error budget — The allowed unreliability under an SLO, typically expressed as a number of minutes of downtime per month or a fraction of requests that may fail. Used to gate feature velocity against stability. See Chapter 97.

Expert parallelism — For MoE models, placing different experts on different GPUs and routing tokens via an all-to-all collective. The natural parallelism strategy for MoE inference. See Chapters 28, 34.

FAISS — Facebook’s vector search library. Supports IVF, HNSW, PQ, and many hybrids. The reference implementation of approximate nearest neighbor search. See Chapter 59.

Feature store — A system that stores precomputed ML features for online serving and historical training, bridging the offline-online consistency gap. Feast, Tecton. Less central to LLM systems than to classical ML. See Chapter 91.

FFN (feed-forward network) — The per-position MLP inside each transformer block. In a dense model, two linear layers with an activation. In an MoE model, a router plus a set of experts. See Chapter 7.

Few-shot — Providing example input-output pairs in the prompt. A free capability of large pretrained models, documented in the GPT-3 paper. See Chapter 8.

FlashAttention — An attention kernel that avoids materializing the full attention matrix in HBM by tiling the computation so the softmax happens in SRAM. 2-4× faster than a naive PyTorch attention implementation. See Chapter 25.

FP8 — 8-bit floating point. Two variants on Hopper and later: E4M3 (more mantissa, less range) for weights and activations; E5M2 (more range) for gradients. Halves memory and bandwidth vs BF16 for roughly 1-2% quality hit when done carefully. See Chapters 13, 26.

FSDP (fully sharded data parallel) — PyTorch’s implementation of ZeRO-3. Shards parameters, gradients, and optimizer state across workers, reconstituting them per layer as needed. Standard for training models that don’t fit in memory. See Chapter 12.

GEMM (general matrix multiply) — The matrix multiplication kernel that dominates compute in every neural network. “The only op that matters.” See Chapter 38.

GGUF — The llama.cpp model file format. A single packed file containing quantized weights, metadata, and tokenizer. The lingua franca of CPU/consumer inference. See Chapter 44.

Golden signals — Google SRE’s four top-level metrics: latency, traffic, errors, saturation. The minimal set for a service dashboard. See Chapter 92.

GPTQ — A weight-only post-training quantization method using second-order information to select which weights can tolerate quantization error. Early and still widely used. See Chapter 26.

GQA (grouped-query attention) — An attention head layout where multiple query heads share a single K/V head, cutting KV cache size without the quality hit of pure multi-query attention. Standard in modern LLMs (Llama 2, 3; Mistral; Qwen). See Chapter 33.

Gradient accumulation — Summing gradients across several forward-backward passes before an optimizer step, to simulate a larger batch than fits in memory. See Chapter 12.

Gradient checkpointing — See activation checkpointing.

GradNorm — A metric (the norm of the global gradient vector) used for diagnosing training stability and for gradient clipping. See Chapter 4.

Greedy decoding — Always picking the highest-probability next token. Deterministic in theory but not in practice due to non-associative floating-point addition in matmul. See Chapter 8.

Guardrails — Content moderation and safety checks applied before and after the model call. Implemented as a separate inference workload (rules-based plus classifier plus LLM judge). See Chapter 56.

HBM (high-bandwidth memory) — The stacked DRAM next to the GPU die. 80 GB on H100, 192 GB on MI300X. Its bandwidth (2-5 TB/s) is usually the binding constraint on decode throughput. See Appendix D.

Helm — The dominant templating/packaging tool for Kubernetes applications. YAML with Go templates. Ugly but entrenched. See Chapter 108.

HNSW (hierarchical navigable small world) — A graph-based ANN index. Fast and high-recall but high memory and hard to update incrementally. The default for many vector databases. See Chapter 59.

Hybrid search — Combining dense retrieval and BM25 and fusing the results. Reliably beats either alone. RRF is the standard fusion method. See Chapter 60.

HyDE (hypothetical document embeddings) — Generating a hypothetical answer with an LLM and using its embedding for retrieval instead of the query’s. Helps when the query and relevant documents are in different registers. See Chapter 63.

Idempotency key — A client-supplied identifier that lets the server recognize and safely ignore a duplicate request. The HTTP pattern for getting exactly-once semantics out of an at-least-once wire. See Chapter 78.

InfiniBand — The high-performance RDMA fabric used for GPU-to-GPU networking in training clusters. NDR (400 Gb/s) and XDR (800 Gb/s) are current. See Appendix D.

InstructGPT — The 2022 paper that introduced RLHF at scale to LLMs. The predecessor to ChatGPT. See Chapter 17.

Internal JWT — A JWT minted by the edge gateway after authenticating the client, then passed to downstream services as the identity token. The alternative to re-validating the original client token at every hop. See Chapter 75.

IVF (inverted file index) — A partition-based ANN index. Clusters the corpus, then at query time searches only the nearest clusters. Cheaper to build and update than HNSW but typically lower recall at the same speed. See Chapter 59.

JAX — Google’s array-programming library. The basis for TPU-native LLM work. Stronger on compilation (XLA) than PyTorch; weaker ecosystem.

JWT (JSON Web Token) — A signed, self-contained token encoding claims about a principal. The default bearer token format for modern APIs. See Chapter 74.

KEDA (Kubernetes Event-Driven Autoscaling) — The autoscaler that scales workloads on custom metrics, including Prometheus queries against GPU-specific metrics like vllm:num_requests_running. Standard for scaling LLM inference on Kubernetes. See Chapter 51.

KL divergence — A non-symmetric distance between two probability distributions. Used in PPO (RLHF) to keep the policy close to the reference model, and in knowledge distillation as the soft-label loss. See Appendix C.

KServe — The Kubernetes-native inference server orchestrator. Defines the InferenceService CRD that wraps a runtime with autoscaling, traffic split, and canary. See Chapter 47.

KTO (Kahneman-Tversky optimization) — A preference-learning method that works on single-utterance binary feedback (“good” / “bad”) rather than pairwise preferences. Simpler data collection than DPO. See Chapter 17.

KV cache — The per-layer per-head per-token K and V vectors stored during decode so they don’t have to be recomputed. The single most important optimization in LLM inference. See Chapter 22.

LayerNorm — Normalization across the feature dimension of a single token, as opposed to batch normalization across the batch dimension. Original transformer used it; modern LLMs use RMSNorm instead. See Chapter 7.

Leaky bucket — A rate-limiting algorithm where requests fill a bucket that leaks at a fixed rate. Smooths bursts into a steady stream. As opposed to token bucket which allows bursting. See Chapter 76.

Linear attention — Attention mechanisms with linear rather than quadratic complexity in sequence length. Mostly haven’t beaten quadratic attention on quality for the same parameter count. See Chapter 41.

Little’s Law — L = λW. The average number of items in a system equals arrival rate times average time per item. The foundation of queueing theory and the back-of-envelope lens for every autoscaling question. See Chapter 77.

LMCache — A KV cache storage system that tiers HBM, DDR, and NVMe, enabling cross-replica sharing via Redis lookups. See Chapters 37, 51.

Load shedding — Dropping requests when overloaded, to protect latency for the ones you accept. The opposite of queueing-till-death. See Chapter 77.

LoRA (low-rank adaptation) — A fine-tuning technique that learns a low-rank decomposition (A × B) of the weight delta, freezing the base weights. Reduces trainable parameters by 1000× while losing little quality. See Chapter 15.

Loss scaling — Multiplying the loss by a large constant before the backward pass in FP16 training so that small gradients don’t underflow. Not needed for BF16 because BF16 has FP32’s exponent range. See Chapter 13.

LSM tree — The storage engine behind RocksDB, Cassandra, and most time-series databases. Not usually relevant to LLM serving itself but appears in the metering and telemetry stacks. See Chapter 87.

MCP (Model Context Protocol) — Anthropic’s 2024 standard for connecting LLMs to external tools, resources, and prompts via a client-server protocol. The emerging universal tool-calling surface. See Chapter 69.

Mean pooling — Averaging the token embeddings to produce a sentence embedding. The default pooling for most sentence-transformers models. See Chapter 9.

Medusa — A speculative decoding technique that adds several prediction heads to the base model and uses them to propose multiple tokens at once. See Chapter 27.

MIG (multi-instance GPU) — NVIDIA’s hardware partitioning of a GPU into multiple isolated compute slices. Useful for multi-tenant platforms that serve small models per tenant. See Chapter 120.

Mixed precision — Training with both FP32 and a lower-precision format (FP16 or BF16) selectively, using low precision where safe (matmuls) and high precision where needed (loss, optimizer state, normalization). See Chapter 13.

MLA (multi-head latent attention) — DeepSeek’s attention variant that compresses K and V into a low-rank latent before caching, drastically shrinking the KV cache. See Chapter 33.

MoE (mixture of experts) — A feed-forward layer variant that routes each token to a subset of “expert” FFNs rather than running the full FFN. Increases parameter count without increasing per-token compute. Mixtral, DeepSeek-V3, Qwen MoE. See Chapter 34.

Model card — A structured description of a model: intended use, training data, evaluation results, known limitations, license. The first thing to read about a new release. See Chapter 10.

Model parallelism — Umbrella term for tensor, pipeline, expert, and sequence parallelism. Splitting the model itself across workers, as opposed to replicating it. See Chapter 28.

mTLS (mutual TLS) — TLS where both client and server present certificates. The default inter-service authentication in a modern service mesh. See Chapter 81.

Multi-query attention (MQA) — An attention variant with a single K/V head shared across all query heads. The extreme version of GQA. Saves memory but costs quality; GQA with 4-8 KV heads is the modern compromise. See Chapter 33.

NCCL — NVIDIA’s collective communication library. The backend for all-reduce and other collectives on NVIDIA hardware. Ring, tree, and hybrid topologies. See Chapters 12, 129.

ONNX (Open Neural Network Exchange) — A vendor-neutral interchange format for ML models. A directed graph of typed operators. Used for cross-platform deployment via ONNX Runtime. See Chapter 40.

ONNX Runtime (ORT) — Microsoft’s optimizing runtime for ONNX models. Supports CPU, CUDA, TensorRT, and DirectML backends. Graph-level optimizations + hardware-specific kernels. See Chapter 40.

Nucleus sampling — See top-p.

NVLink — NVIDIA’s high-bandwidth GPU-to-GPU interconnect, much faster than PCIe. Fifth-generation NVLink on Blackwell does 1.8 TB/s per GPU. See Appendix D.

NVSwitch — The NVLink switch chip that connects multiple GPUs in a fully-connected NVLink fabric. The basis for 8-GPU (HGX) and 72-GPU (GB200 NVL72) systems. See Appendix D.

OCI (Open Container Initiative) — The standard for container image format and runtime. What “Docker image” really means today. See Chapter 102.

OIDC (OpenID Connect) — An identity layer on top of OAuth 2.0 that defines an ID token (a JWT) for authenticated sessions. The standard federation protocol. See Chapter 74.

Open-loop benchmark — A load test where request arrivals are independent of the system’s response time (Poisson arrivals, for instance). Closer to real traffic behavior than a closed-loop benchmark where a fixed number of clients wait for responses. See Chapter 55.

Optimizer state — The per-parameter memory maintained by the optimizer: for AdamW, two moments in FP32, for 8 bytes per parameter. Dominates training memory for large models. See Chapters 4, 12.

Outliers — The minority of activations or weights that have unusually large magnitude, breaking quantization schemes that assume a narrow distribution. The SmoothQuant and AWQ work is basically about outlier handling. See Chapter 26.

PagedAttention — The vLLM technique that stores KV cache in fixed-size physical blocks addressed by a block table, eliminating internal fragmentation and enabling copy-on-write for prefix sharing. See Chapter 24.

Parquet — A columnar storage format for analytical data. The file format behind most modern data lakes and lakehouses. See Chapter 90.

PEFT (parameter-efficient fine-tuning) — The family of fine-tuning methods that update a small subset of parameters. LoRA is the dominant member. See Chapter 15.

Perplexity — exp(cross-entropy loss). Historically the main LM evaluation metric, now mostly used for training diagnostics because it doesn’t predict downstream task performance very well. See Chapter 20.

Pipeline parallelism — Splitting the model’s layers across GPUs and running a pipeline of microbatches through them. Imperfect bubble fill without careful scheduling. See Chapter 28.

Post-norm vs pre-norm — Whether the LayerNorm comes after or before the residual addition in a transformer block. Pre-norm (norm first) trains more stably at depth and is universal in modern LLMs. See Chapter 7.

PPO (proximal policy optimization) — The reinforcement learning algorithm used in classical RLHF. A clipped-ratio policy gradient method. Replaced in practice by DPO for most teams. See Chapter 17.

Prefill — The first forward pass of LLM inference that processes the entire prompt and populates the KV cache. Compute-bound and parallelizable across prompt tokens. See Chapter 21.

Prefix caching — Sharing the KV cache for a common prompt prefix across requests, so each request only computes attention for its unique suffix. vLLM block-level, SGLang radix-level. See Chapter 29.

Principal — The identity making a request: a user, a service, an agent. Carried through the system in a context object or a JWT. See Chapter 75.

Prometheus — The standard metrics system for cloud-native infrastructure. Pull-based, time-series, with PromQL as the query language. See Chapter 93.

Prompt caching — Used loosely to mean both KV-level prefix caching and API-level “we memoized the output for this prompt.” Context-dependent. See Chapter 29.

Prompt injection — An attack where untrusted content in the input (tool output, retrieved document) contains instructions the model follows. The hardest unsolved problem in agent safety. See Chapter 71.

PyTorch — The dominant deep learning framework. Imperative, Pythonic, good for research, now also good for production. See Chapter 2.

Q, K, V — Query, key, value. The three projections of the input that feed attention. Q comes from the current token, K and V come from all past tokens. See Chapter 6.

Quantization — Representing weights or activations with fewer bits than the training precision. INT8, INT4, FP8 are the main targets. Trade-off between quality, speed, and memory. See Chapter 26.

QLoRA — Fine-tuning using LoRA adapters over a 4-bit quantized base model. Makes fine-tuning a 70B model possible on a single 48 GB GPU. See Chapter 15.

Radix attention — SGLang’s token-level prefix cache. A radix tree indexes every KV cache prefix ever seen, so any shared prefix across requests or across turns is reused. See Chapter 29.

Rate limiting — Restricting how many requests a client may make in a time window. Token bucket, leaky bucket, GCRA, fixed window, sliding window. See Chapter 76.

RDMA — Remote direct memory access. Moving memory between machines without involving either CPU. The mechanism behind InfiniBand and RoCE. See Appendix D.

ReAct — “Reasoning and acting.” The agent pattern where the model alternates Thought and Action steps. The original agent loop. See Chapter 67.

Recall@k — The fraction of relevant documents that appear in the top k retrieval results. The core retrieval metric. See Chapter 64.

RED method — Tom Wilkie’s three metrics for request-driven services: rate, errors, duration. A compact subset of the golden signals. See Chapter 92.

Reranker — A cross-encoder (or LLM) applied to the top-k results of a cheaper retriever, reordering them by relevance. The standard second stage of a retrieval pipeline. See Chapter 62.

Residual connection — Adding the input of a block to its output, so gradients flow around the block. The trick that made very deep networks trainable. See Chapter 7.

RLHF (reinforcement learning from human feedback) — Training a reward model from human preference data, then optimizing the policy via PPO to maximize it. The InstructGPT recipe. Now mostly replaced by DPO. See Chapter 17.

RMSNorm — A simpler alternative to LayerNorm that only rescales by root-mean-square (no shift, no mean subtraction). Nearly as good and cheaper. Standard in Llama and most modern LLMs. See Chapter 7.

RoCE (RDMA over converged Ethernet) — RDMA running on Ethernet instead of InfiniBand. Cheaper and more standard but requires lossless Ethernet configuration. See Appendix D.

RoPE (rotary position embedding) — The positional encoding that rotates Q and K by an angle proportional to position. Supports extension to longer contexts via scaling techniques like YaRN. Universal in modern LLMs. See Chapter 35.

RPS — Requests per second. The traffic unit.

RRF (reciprocal rank fusion) — A fusion method for combining multiple ranked lists by summing 1/(k+rank). Simple, parameter-light, and effective. See Chapter 60.

Runtime — In serving, the component that actually runs the model (vLLM, TGI, TEI, TensorRT-LLM). As distinct from the orchestrator that manages deployments (KServe, BentoML, Ray Serve). See Chapter 45.

Sampling (decoding) — Drawing the next token from the model’s predicted distribution, with modifications: temperature, top-k, top-p, min-p, repetition penalty. See Chapter 8.

Shared memory (SRAM) — On-chip, programmer-managed memory on each GPU SM. ~228 KB/SM on H100. ~20 cycle latency, ~19 TB/s bandwidth. The key to FlashAttention’s performance — tiles of Q/K/V fit here so the score matrix never touches HBM. See Chapter 39.

SM (streaming multiprocessor) — The basic compute unit on an NVIDIA GPU. Each SM has its own registers, shared memory, and scheduler. An H100 has 132 SMs. See Chapter 39.

ScaNN — Google’s ANN library. Uses quantization plus a scoring trick called anisotropic quantization. High recall at low latency. See Chapter 59.

Self-attention — Attention where Q, K, V all come from the same sequence. The transformer’s core operation. See Chapter 6.

Sequence parallelism — Splitting a single long sequence across GPUs. Used for long-context training and inference where a single sequence exceeds one GPU’s memory. Ring attention is the classic technique. See Chapters 28, 35.

Service mesh — A layer of sidecar proxies (Envoy, Linkerd) that handle inter-service traffic, providing mTLS, traffic shaping, observability. See Chapter 81.

SFT (supervised fine-tuning) — Training a base model on labeled (prompt, response) pairs to teach it the assistant format. The step before alignment. See Chapter 16.

SGLang — A serving framework with RadixAttention for token-level prefix caching and a DSL for structured programs. Competes with vLLM. See Chapter 44.

Shard — A fragment of a parameter, dataset, or cache held by one worker. Central to FSDP and ZeRO. See Chapter 12.

Sliding window attention — Attention where each token only attends to a fixed window of previous tokens. Mistral 7B used it. Trades context capacity for memory. See Chapter 35.

SLO (service level objective) — The internal target for reliability. Looser than the external-facing SLA. “99.9% of requests at p99 < 2s over any 28-day window.” See Chapter 97.

SmoothQuant — A quantization technique that shifts the quantization difficulty from activations to weights by dividing activations by a per-channel scale and multiplying weights by the same scale. See Chapter 26.

Softmax — The function that turns a vector of logits into a probability distribution. The nonlinearity inside attention and the output projection. See Chapter 6.

Speculative decoding — Generating several candidate tokens with a cheap draft model and verifying them in parallel with the big model. The main way to cut decode latency without changing the big model. See Chapter 27.

SSE (server-sent events) — HTTP streaming where the server pushes a stream of text events to the client over one long-lived connection. The wire format for streaming LLM responses. See Chapter 79.

Structured generation — Constraining the model to emit only outputs that fit a grammar (JSON schema, regex, BNF). Implemented by masking disallowed tokens at each step via an FSM. See Chapter 43.

SVD (singular value decomposition) — The factorization M = UΣVᵀ. The math behind low-rank approximation and the justification for LoRA. See Appendix C.

SwiGLU — The gated activation used in modern transformer FFNs. Three linear layers per FFN instead of two, with the gate computed as swish(xW) ⊙ xV. Slightly better than GELU. See Chapter 7.

Tail latency — The high percentiles (p99, p99.9) of the latency distribution. What users actually feel; what averages hide. See Chapter 31.

TBT (time between tokens) — The decode-phase latency metric: how long between successive output tokens. As distinct from TTFT (time to first token). See Chapter 31.

TEI (text-embeddings-inference) — HuggingFace’s runtime for embedding and reranker models. Different workload shape from vLLM; batches many short requests rather than fewer long ones. See Chapter 49.

Temperature — The sampling hyperparameter that divides logits before the softmax. Temperature 0 is greedy; higher temperature increases entropy. See Chapter 8.

Tensor — A multi-dimensional array. The primary data type in PyTorch, JAX, TensorFlow. See Chapter 1.

TensorRT — NVIDIA’s inference optimizer. Parses a model graph, applies layer fusion, precision calibration, and kernel auto-tuning to produce a hardware-specific engine. Maximum inference speed on NVIDIA GPUs, but the engine is not portable across GPU types. See Chapter 40.

torch.compile — PyTorch’s graph-mode optimizer (PyTorch 2.0+). TorchDynamo captures the Python-level graph, AOTAutograd records the backward, TorchInductor generates Triton kernels. One line (torch.compile(model)) for free fusion and codegen. See Chapter 40.

TorchDynamo — The Python bytecode tracer inside torch.compile. Intercepts frame evaluation to capture the computation graph without requiring code changes. See Chapter 40.

TorchInductor — The backend compiler in torch.compile that takes the captured graph and generates optimized Triton GPU kernels or C++ CPU code. See Chapter 40.

Tensor parallelism — Splitting individual matmuls across GPUs by partitioning the matrices column-wise or row-wise. Requires high-bandwidth GPU interconnect (NVLink) because of the all-reduce at each layer. See Chapter 28.

Thundering herd — The problem where many requests simultaneously miss the same cached item and all recompute it. Mitigated by request coalescing or random TTLs. See Chapter 89.

Token — The unit of input and output for an LLM. A subword, not a word. “Tokens are bytes, not concepts.” See Chapter 5.

Token bucket — A rate-limiting algorithm where a bucket of tokens refills at a constant rate and requests consume tokens. Allows bursting up to bucket size. See Chapter 76.

Tool calling — The protocol by which a model emits a structured request to invoke a function. OpenAI’s function calling, Anthropic’s tool use, MCP. See Chapter 66.

Top-k sampling — Restricting sampling to the k highest-probability tokens. See Chapter 8.

Top-p (nucleus) sampling — Restricting sampling to the smallest set of tokens whose cumulative probability exceeds p. See Chapter 8.

TPU — Google’s tensor processing unit. Different architecture from GPUs (large systolic array, HBM). Only usable inside Google Cloud. See Appendix D.

Tracing — Distributed request tracing via propagated span contexts. OpenTelemetry is the standard. See Chapter 95.

Triton (kernel DSL) — OpenAI’s Python DSL for writing GPU kernels. The way most non-NVIDIA-staff engineers write fused kernels today. See Chapter 38.

Triton Inference Server — NVIDIA’s general-purpose model server. Confusingly named; unrelated to the kernel DSL. See Chapter 45.

TTFT (time to first token) — The latency from request arrival to the first output token. Dominated by prefill plus queue wait. The critical metric for chat UX. See Chapter 31.

USE method — Brendan Gregg’s three metrics for resources: utilization, saturation, errors. Complementary to RED. See Chapter 92.

Vector database — A store that indexes embeddings for ANN search. Milvus, Qdrant, Weaviate, Pinecone, pgvector, Elasticsearch with dense vector fields. See Chapter 59.

ViT (vision transformer) — A transformer applied to image patches. The vision-encoder half of most VLMs. See Chapter 32.

vLLM — The reference open-source LLM serving framework. PagedAttention, continuous batching, prefix caching, tensor parallelism. The default choice for most production deployments. See Chapter 48.

VLM (vision-language model) — A transformer that consumes both image patches (via a vision encoder) and text tokens in the same stream. Qwen-VL, LLaVA, InternVL, GPT-4V. See Chapter 32.

W3C Trace Context — The standard headers (traceparent, tracestate) for propagating trace IDs across services. See Chapter 95.

Workflow — A deterministic state machine for orchestrating long-running or distributed work. As distinct from an agent, which is nondeterministic. Temporal, Cadence, Airflow, Step Functions. See Chapter 70.

YaRN — A RoPE extension scheme for stretching a model’s context length beyond its training context. See Chapter 35.

ZeRO — DeepSpeed’s set of optimizer-state, gradient, and parameter sharding techniques (stages 1, 2, 3) for memory-efficient data-parallel training. FSDP is PyTorch’s port. See Chapter 12.

Zero-shot — Asking the model to do a task without any examples in the prompt. The default mode of modern instruction-tuned LLMs. See Chapter 8.


Additional terms (cross-cutting)

ACID — Atomicity, consistency, isolation, durability. The set of database transaction guarantees that open table formats like Iceberg and Delta Lake bring to object-storage data. See Chapter 90.

AOF (append-only file) — Redis’s durability mechanism. Every write command is appended to a log so the state can be reconstructed on restart. See Chapter 89.

At-least-once delivery — A message delivery guarantee where messages may be delivered more than once but will never be lost. The default for Kafka consumers. Requires idempotent downstream consumers to avoid double-processing. See Chapters 76, 82.

At-most-once delivery — Messages may be lost but will never be duplicated. Rarely what you actually want. See Chapter 84.

Auto-GPTQ — A packaged implementation of GPTQ quantization for Hugging Face models. See Chapter 26.

BFF (backend-for-frontend) — An API gateway pattern where each client type (web, mobile, internal) has its own tailored backend service. See Chapter 73.

Block table — In PagedAttention, the per-sequence data structure mapping logical positions to physical KV blocks. Analogous to a page table in an OS. See Chapter 24.

Bursting — In rate limiting, allowing a client to exceed their steady-state rate briefly before being throttled. Token bucket allows bursting up to the bucket size. See Chapter 76.

Calibration — In quantization, running a small sample of inference data through the unquantized model to measure activation statistics, then using those stats to set quantization scales. Essential for FP8 and activation quantization. See Chapter 26.

Cell — An independent copy of a full service stack, scoped to bound blast radius. Failure of one cell should not affect others. See Chapter 109.

Circuit breaker — A stability pattern that trips open when a downstream dependency fails repeatedly, short-circuiting subsequent calls for a cooldown period to give the dependency time to recover. See Chapter 77.

CLIP — OpenAI’s contrastive image-text model. The vision encoder in many VLMs (or its descendants like SigLIP). See Chapter 32.

CoT (chain of thought) — See chain of thought.

CRD (custom resource definition) — Kubernetes extension mechanism for defining new resource types. KServe’s InferenceService is a CRD. See Chapter 47.

CUDA — NVIDIA’s GPU programming model and runtime. See Chapter 38.

Daemon set — A Kubernetes workload type that runs one pod per node. Used for node-local services like weight caches and monitoring agents. See Chapter 52.

Data parallelism — See DDP.

Distillation — See the main entry.

Embedding model — See embedding.

Envoy — The high-performance proxy at the core of most service meshes and many API gateways. Written in C++. See Chapters 48, 79.

Eventual consistency — A consistency model where all replicas will converge to the same state given enough time, but may disagree briefly. S3 used to be eventually consistent; it’s now strongly consistent. See Chapter 85.

Exponential backoff — A retry schedule where the wait time doubles after each failure, with jitter. The default for any retry loop. See Chapter 78.

Fan-out / fan-in — Architectural patterns where one request triggers many downstream calls (fan-out) then aggregates responses (fan-in). Common in search and recommendation systems. See Chapter 73.

Federation (Prometheus) — A pattern where a top-level Prometheus scrapes aggregates from several lower-level Prometheus instances. Used to scale beyond a single server. See Chapter 93.

Flame graph — A visualization of stack samples showing which code paths spend the most time. The signature output of a profiler. See Chapter 96.

FlashAttention-2 / 3 — Successive versions of FlashAttention. FA2 added better parallelism across sequences and heads; FA3 (Hopper-only) added asynchronous Tensor Core pipelining and FP8 support. See Chapter 25.

Forward pass — One invocation of the model on an input. The forward pass is what inference does; training is forward plus backward. See Chapter 2.

Four Ms — An SRE framing: mitigation, measurement, management, morale. Complementary to the golden signals. See Chapter 92.

FP16 — 16-bit floating point, 5 exponent bits and 10 mantissa bits. Narrow exponent range — gradients underflow without loss scaling. Mostly replaced by BF16 for training. See Chapter 13.

GELU — Gaussian error linear unit. A smooth approximation of ReLU used in older transformers. Replaced by SwiGLU in modern LLMs. See Chapter 7.

GitOps — A deployment philosophy where the desired state is in Git and a controller continuously reconciles the cluster to it. ArgoCD and Flux are the canonical tools. See Chapter 107.

Gradient accumulation — See the main entry.

gRPC — A high-performance RPC framework over HTTP/2 with Protocol Buffers as the wire format. Dominant in service-to-service communication. See Chapter 104.

GuardRails (pattern) — See guardrails.

Hallucination — When an LLM generates confident but false information. A fundamental limitation of the architecture, not a bug. Mitigated by RAG (grounding in retrieved context) and by constrained generation. See Chapters 63, 69.

Hopper — The GPU architecture of H100 and H200. Generation after Ampere, before Blackwell. Introduced FP8 and the Transformer Engine. See Appendix D.

Horizontal pod autoscaler (HPA) — Kubernetes built-in autoscaler. Scales pods by CPU or memory by default. Useless for GPU workloads without custom metrics — that’s why KEDA exists. See Chapter 51.

ICL (in-context learning) — The ability of a large pretrained model to learn from examples in the prompt without gradient updates. Few-shot. Chapter 8.

Ingress — In Kubernetes, the API object that maps external traffic to in-cluster services. Implemented by an ingress controller (nginx, Envoy, Traefik). See Chapter 112.

Instrumentation — Adding metric, log, and trace emit points to code. See Chapter 95.

Jacobian — The matrix of partial derivatives of a vector-valued function. The multivariate version of a derivative. Shows up in backprop; usually not materialized explicitly. See Appendix C.

JSONL — JSON lines. One JSON object per line. The default format for streaming structured data to disk. See Chapter 16.

Judge (LLM-as-judge) — Using one LLM to score the outputs of another on quality dimensions. The practical eval method for generative tasks. See Chapter 20.

K8s — Abbreviation for Kubernetes.

Knowledge distillation — See distillation.

Latent — A compressed internal representation produced by a model. In MLA (Chapter 33), K and V are compressed into a latent vector before caching.

LM head — The final linear projection in a language model that maps the last hidden state to vocabulary logits. Often tied to the embedding matrix. See Chapter 7.

Load balancer — The component that distributes requests across replicas. Layer 4 (TCP) or Layer 7 (HTTP). See Chapter 73.

LLM-as-judge — See judge.

LRU — Least recently used. The default cache eviction policy. See Chapter 89.

Manifest (Kubernetes) — A YAML file declaring Kubernetes resources. The unit that GitOps pipelines manage. See Chapter 107.

Metrics — Numerical time-series data about system behavior. One of the three (or four) pillars of observability. See Chapter 93.

Model registry — A service that stores model artifacts, metadata, and lineage, typically as a step between training and serving. See Chapter 45.

Monorepo — A single version control repository containing many independently deployable services and libraries. The alternative is polyrepo. See Chapter 101.

MTEB (massive text embedding benchmark) — The standard leaderboard for text embedding models. See Chapter 58.

Namespace — In Linux, a kernel isolation primitive used by containers (PID, network, mount, UTS, IPC, user, cgroup). In Kubernetes, a logical grouping of resources. See Chapter 102.

NCCL — See the main entry.

Noisy neighbor — A tenant whose resource consumption degrades performance for others sharing the same infrastructure. The central concern of multi-tenant design. See Chapter 120.

OCI — See the main entry.

On-call — The rotation of engineers responsible for production incidents. The existence of on-call is the reason SREs care about SLOs. See Chapter 99.

Online learning — Updating a model on new data as it arrives, without a full retrain. Rare for LLMs; common for recommendation systems. See Chapter 119.

OpenAPI — The REST API specification language (formerly Swagger). The contract-first alternative to gRPC. See Chapter 104.

OpenTelemetry — The vendor-neutral standard for instrumentation of metrics, logs, and traces. See Chapter 95.

Operator (Kubernetes) — A controller that encodes domain knowledge about how to manage an application on Kubernetes. Strimzi is a Kafka operator; Prometheus Operator manages Prometheus. See Chapter 47.

ORR (operational readiness review) — The checklist a service must pass before going to production: capacity, observability, runbooks, backups, on-call, security, dependencies. See Chapter 100.

Outlines / XGrammar — Libraries that constrain LLM generation to match a grammar or schema via per-step token masking. See Chapter 43.

Pre-norm — See post-norm vs pre-norm.

Presigned URL — An S3-style URL with embedded signature giving temporary access to an object without exposing credentials. Standard for client upload/download flows. See Chapter 85.

Profile — A snapshot of where CPU or memory is being spent in a program. Continuous profiling runs sampling profilers in production. See Chapter 96.

Prompt caching — See the main entry.

Propagator — An OpenTelemetry component that serializes trace context to outgoing request headers and deserializes it on incoming requests. See Chapter 95.

QPS — Queries per second. Same as RPS. The basic throughput unit.

Reranker — See the main entry.

Resource (Kubernetes) — A typed object in the Kubernetes API (Pod, Service, Deployment, InferenceService). See Chapter 47.

Rollout — A deploy of a new version. Can be direct, canary, blue-green, rolling. See Chapter 98.

Runbook — A document describing how to respond to a specific alert or incident type. Should be short, actionable, and version-controlled. See Chapters 97, 98.

Runtime (serving) — See the main entry.

Sampling (observability) — Keeping only a fraction of logs or traces to control cost. Head-based samples at ingest; tail-based samples after a trace completes, letting you keep slow or errored traces preferentially. See Chapter 95.

Sharding — Splitting data (or computation) across multiple machines by key. See Chapter 86.

Sidecar — A helper container running alongside the main container in the same pod, sharing the network namespace. Standard for service mesh proxies and telemetry collectors. See Chapter 81.

Single-table design (DynamoDB) — A schema style where all entity types share one table, using composite keys for access patterns. See Chapter 86.

Span — A single unit of work in a distributed trace. Has a start time, duration, parent span, and attributes. See Chapter 95.

SSE — See server-sent events.

SSM (state-space model) — See state-space models (Mamba, S4).

SwiGLU — See the main entry.

Telemetry — Metrics, logs, and traces collectively. See Chapter 92.

Tensor Core — The NVIDIA GPU hardware unit that accelerates matrix multiplies. The reason H100 is so much faster than V100. See Chapter 38.

tiktoken — OpenAI’s BPE tokenizer library, used by GPT models. See Chapter 5.

Tokens-per-second (tok/s) — The throughput unit for LLM inference. Usually reported as per-GPU and as aggregate. See Chapter 30.

Topology (cluster) — The physical arrangement of nodes and interconnects. 8-GPU HGX, multi-node InfiniBand cluster, NVL72 rack. See Appendix D.

Tracing — See distributed tracing.

Traffic split — Routing a fraction of traffic to a different backend, typically for canary rollouts. See Chapter 98.

Transformer Engine — NVIDIA’s library for FP8 training on Hopper and later. Handles the calibration and scaling automatically. See Chapter 13.

Triton (inference server) — See the main entry.

Triton (kernel DSL) — See the main entry.

TTL (time to live) — A per-entry expiration timestamp in a cache. When it expires, the entry is evicted on access. See Chapter 89.

USE method — See the main entry.

VLM — See the main entry.

vRAM — Colloquial for GPU memory; technically HBM on datacenter GPUs. See Appendix D.

Warmup — The initial phase of a serving process where caches are cold, CUDA graphs are not yet compiled, and latency is high. Addressed by running a synthetic workload before accepting traffic. See Chapter 54.

Watchdog — A process or timer that detects when another process has hung and takes action. Used for deadlock detection in training and for health checks in serving.

Webhook — An HTTP callback from one service to another. Used in Kubernetes admission controllers, GitOps reconciliation, and CI integrations.

YAML — The configuration format of Kubernetes, Helm, and most modern CI systems. Whitespace-sensitive. Love-hate. See Chapter 108.