Why is decode described as memory-bandwidth-bound while prefill is described as compute-bound?

During decode each step processes a single token requiring a full weight read but very little arithmetic, while prefill processes many tokens simultaneously achieving high arithmetic intensity

Arithmetic intensity (FLOPs per byte read) is proportional to the number of tokens processed in parallel. Single-token decode has AI near 1 (well below GPU compute-memory crossover), while multi-token prefill achieves much higher AI.
What metrics correspond to prefill latency and decode latency respectively in a production serving context?

Time-to-first-token (TTFT) and time-per-output-token (TPOT)

TTFT measures how long the user waits before seeing any response; it is dominated by the prefill phase. TPOT measures the per-token generation speed; it is dominated by decode latency and directly sets the streaming speed the user perceives.
A customer-facing chat application has prompts averaging 200 tokens and completions averaging 500 tokens. Is this workload prefill-bound or decode-bound, and what optimization should be prioritized?

Decode-bound because the long completion requires many sequential decode steps; prioritize KV cache memory efficiency and batching to improve decode throughput

Each of the 500 output tokens requires a separate decode step. With 200-token prefill (one pass) versus 500 decode steps, the workload is clearly decode-heavy. Maximizing batch size and reducing per-token memory access time has the most impact.
Prefill-decode disaggregation splits the two phases onto separate machines. What is the key trade-off this introduces?

Disaggregation improves utilization of both phases independently but requires transferring large KV cache tensors between prefill and decode nodes over the network

Separate prefill and decode fleets can each be optimized independently (e.g., beefy prefill GPUs, many decode GPUs). The cost is that after prefill completes, the KV cache must be transferred to the decode node, adding network latency and bandwidth pressure.

Ch 22: The KV cache: the single most important optimization in LLM inference

Without a KV cache, how does the compute cost scale with sequence length during autoregressive generation of T output tokens after a prompt of length S?

Quadratically in S due to attention, scaled by T repetitions — roughly O(S x T) attention operations

Without caching, every decode step re-runs attention over the full growing sequence. Step t processes (S + t) tokens through O(n^2) attention, summing to roughly S*T total attention work over T steps.
What does the KV cache store, and why does it only store K and V and not Q?

It stores the K and V projections from previous positions because future attention queries need to attend to past keys and values but past queries are never needed again

The attention equation is softmax(Q K^T) V. At each new decode step the query comes from the new token and attends to all previous keys and values. The old queries are never referenced again so caching them provides no benefit.
How do you calculate the KV cache memory per token for a transformer layer with H heads, D head dimension, and precision P bytes, summed over L layers?

L x 2 x H x D x P bytes per token

Each layer stores one K and one V matrix per token, each of shape (H, D). That is 2 x H x D values per layer per token. Multiplying by L layers and P bytes per value gives the total.
At a batch of 128 concurrent requests each generating up to 2048 tokens, a 70B model's KV cache exceeds GPU HBM capacity. Which architectural decision made this problem worse compared to an older 13B model?

70B models have larger hidden dimension and more layers, both of which multiply directly into the per-token KV cache formula

KV cache size scales as L x 2 x H x D x P per token. A 70B model has more layers L and a larger head dimension D than a 13B model. Both factors multiply directly, making the total cache size roughly proportional to the parameter count.

Ch 23: Batching: static, dynamic, continuous (Orca)

Why does batching improve LLM serving throughput in a memory-bandwidth-bound decode regime?

Weight bytes read from HBM are amortized across all requests in the batch, so cost per user drops near-linearly with batch size

In memory-bound decode, the bottleneck is reading model weights from HBM. A single forward pass reads the weights once regardless of batch size, so serving N users in one pass costs 1/N of the weight-read bandwidth per user.
What is the padding waste problem in static batching and why does it reduce GPU utilization?

Requests in a batch have different lengths and must be padded to the longest, so the GPU performs attention over padding tokens that contribute nothing to output

Every padded position requires computation (attention, feedforward) that produces no useful result. In a batch where the longest sequence is 1024 and the average is 256, roughly 75% of the compute is spent on padding.
Continuous (iteration-level) batching allows new requests to join an in-flight batch between decode steps. What problem does this solve that dynamic batching cannot?

In dynamic batching, short requests finish early and their GPU slots sit idle waiting for the rest of the batch; continuous batching immediately fills those slots with new requests

In static or dynamic batching the batch runs until the longest request finishes. Requests that complete early leave their GPU capacity idle. Continuous batching inserts new requests at any iteration boundary, keeping the batch full and GPU utilization high.
Mixing a new prefill request into an ongoing continuous decode batch causes a 'prefill bubble.' Why, and how do modern schedulers handle it?

Prefill is compute-bound while decode is memory-bound; running both together halves the effective throughput of each. Schedulers use chunked prefill to spread the prefill across multiple decode iterations

A large prefill inserted into the batch changes the workload from memory-bound decode to compute-bound prefill for one step, stalling ongoing decode requests. Chunked prefill (splitting prefill across N iterations) keeps each iteration's compute profile stable.

Ch 24: PagedAttention and vLLM as a virtual-memory system for KV cache

Without PagedAttention, why does pre-allocating the maximum sequence length per request waste so much memory?

The maximum sequence length must be allocated up front per request but most requests are far shorter, leaving the majority of reserved memory unused

A request with a 300-token completion still reserves space for max_seq_len tokens (e.g., 32768). At roughly 10 GB per max-length slot for a 70B model, even a few requests exhaust GPU memory despite using a fraction of it.
PagedAttention borrows the concept of paging directly from operating system virtual memory. What is the mapping in the analogy?

Physical GPU memory blocks map to OS physical pages; logical KV cache slots per request map to virtual addresses

Just as virtual memory lets a process address a sparse logical address space backed by physical pages allocated on demand, PagedAttention lets each request address a logical KV sequence backed by physical GPU memory blocks allocated one block at a time.
How does PagedAttention's copy-on-write mechanism enable efficient prefix caching for requests that share a common prompt prefix?

The block table for each request points to the same physical KV blocks for the shared prefix; new blocks are allocated only when a request diverges and writes new tokens

Multiple requests can reference the same physical block via their block tables with a reference count. When a request needs to write beyond the shared prefix it gets a new private block — exactly the copy-on-write semantics from OS fork.
PagedAttention reports 2 to 4x throughput improvement on real workloads over contiguous KV allocation. This improvement comes primarily from which effect?

Reducing internal fragmentation allows more concurrent requests to fit in the same GPU memory, increasing the effective batch size and therefore amortizing weight reads over more users

PagedAttention's primary win is packing more concurrent requests into the same HBM by eliminating wasted reserved-but-unused space. Higher concurrency means higher batch size, which directly improves memory-bound decode throughput.

Ch 25: FlashAttention and the GPU memory hierarchy

Standard attention materializes the full (S x S) attention score matrix as an intermediate tensor. Why is this the dominant cost for long sequences?

Reading and writing the S x S matrix to HBM requires O(S^2) bytes of memory traffic, which on a bandwidth-limited GPU dominates FLOPs cost

For large S the attention matrix is huge (e.g., S=8192 -> 256M fp16 values = 512 MB). Even if FLOPs are manageable, reading and writing this tensor to HBM bandwidth-limits the operation. FlashAttention eliminates this read/write.
FlashAttention computes the same mathematical result as standard attention but without materializing the full attention matrix. What is the key algorithmic insight that makes this possible?

The online softmax trick allows the exact softmax to be computed incrementally in tiles that each fit in SRAM, accumulating the weighted value sum without ever storing the full score matrix

The online softmax algorithm (Milakov and Gimelshein, 2018) updates a running max and normalization factor tile-by-tile, allowing exact softmax without ever holding the complete S x S matrix. FlashAttention uses this to tile computation entirely in SRAM.
FlashAttention achieves wall-clock speedup without reducing FLOPs. Why does reducing HBM reads and writes translate to faster wall-clock time on GPUs?

Attention is in the HBM-bandwidth-bound regime, meaning the bottleneck is data movement not arithmetic. Reducing HBM traffic directly reduces the operation's wall-clock time

When a kernel is bandwidth-bound (as attention is for typical sequence lengths), the runtime is approximately bytes_transferred / bandwidth. Halving HBM traffic halves runtime even if FLOPs are unchanged.
FlashAttention-2 introduced work partitioning improvements. Why does distributing work across warps within a thread block improve performance specifically for the attention computation?

Naive warp assignment causes synchronization stalls in the reduction across the sequence dimension; better partitioning minimizes cross-warp synchronization and maximizes SRAM reuse

In v1, warp-level reductions introduced synchronization overhead within each tile computation. FA2 reorganizes which dimensions each warp handles to minimize barriers and keep all warps busy, improving occupancy and reducing idle cycles.

Ch 26: Quantization: INT8, INT4, FP8, AWQ, GPTQ, SmoothQuant

A 70B model in INT4 has roughly 4x faster minimum decode latency than the same model in BF16. What is the direct mechanism for this speedup?

Decode latency is bounded by weight bandwidth. INT4 weights are 4x smaller than BF16 so 4x fewer bytes are read from HBM per step

Minimum decode latency is approximately model_size / HBM_bandwidth. Quantizing from 2 bytes per weight (BF16) to 0.5 bytes per weight (INT4) reduces model size by 4x, which directly reduces the bandwidth-bound latency floor by 4x.
AWQ (Activation-Aware Weight Quantization) produces better quality at INT4 than naive round-to-nearest quantization. What is its key insight?

A small fraction of weight channels are salient for output quality; AWQ identifies these channels using activation magnitudes and protects them with per-channel scaling before quantization

Not all weights matter equally. AWQ observes that channels with large activation scales amplify quantization errors. By scaling those channels up before quantization (and compensating at inference), it reduces their effective error without changing the weight bit-width.
SmoothQuant addresses the INT8 activation outlier problem by migrating quantization difficulty from activations to weights. Concretely, what does it do?

It applies a per-channel scaling factor to activations (dividing by s) and the corresponding weight channels (multiplying by s), shifting quantization error from the harder-to-quantize activations to the easier-to-quantize weights

Activations have per-channel magnitude variation that makes uniform INT8 quantization lossy. SmoothQuant rescales each channel so the activation range shrinks and the weight range grows proportionally, making both easier to quantize to INT8 with low error.
KV cache quantization is applied separately from weight quantization. Why might INT4 KV cache quantization be riskier than INT4 weight quantization at the same bit-width?

KV values are dynamically generated per request and have input-dependent outlier patterns that cannot be calibrated offline, making quantization error harder to bound

Weight quantization errors can be measured and compensated during offline calibration on representative inputs. KV cache values depend on the specific prompt and generation context, so their dynamic range varies unpredictably, making it harder to choose safe scale factors.

Ch 27: Speculative decoding: Medusa, EAGLE, MTP

What is the primary reason speculative decoding can yield multiple tokens per big-model forward pass without changing the output distribution?

Rejected tokens are discarded and the target model resamples, preserving the target distribution mathematically

The acceptance-rejection scheme uses a corrected resample that provably yields draws from the target distribution. Draft tokens whose probability ratio passes are accepted; otherwise a corrected sample is drawn from the target.
Why does speculative decoding provide the largest speedup in the memory-bandwidth-bound decode regime rather than in prefill?

Decode is memory-bandwidth-bound at AI near 1, so verifying K draft tokens costs roughly the same as verifying 1

Decode is memory-bound, meaning the GPU is stalled waiting for weights from HBM regardless of how many tokens it processes. Verifying K tokens in one forward pass costs nearly the same as verifying 1, so K correct guesses means K tokens for the price of one.
EAGLE differs from classic speculative decoding primarily because it drafts using which architectural approach?

A lightweight autoregressive head that takes the target model's hidden states as input

EAGLE attaches a lightweight autoregressive draft head that consumes the target model's feature vectors, enabling much higher acceptance rates because the draft is conditioned on the same representation the target model uses.
A team is deploying speculative decoding for a long-form document summarization workload where each request generates 4096 output tokens. Which factor most threatens a high acceptance rate?

Long generation sequences cause the draft model to diverge from the target model's distribution more over time

Acceptance rate depends on how well the draft model predicts the target's next token. Over long generations, distribution drift accumulates, especially on creative or low-perplexity regions. The draft model's predictions become less aligned with the target.

Ch 28: Tensor, pipeline, expert, and sequence parallelism for inference

Why is tensor parallelism (TP) strongly preferred within a single node for inference, whereas pipeline parallelism (PP) is preferred across nodes?

TP requires NVLink bandwidth for all-reduces after each transformer block, which only exists within a node

TP's all-reduce after every attention and FFN block generates a synchronization per layer that saturates a slow inter-node link. NVLink within a node handles this cheaply; cross-node PCIe or InfiniBand makes it prohibitive.
In the context of inference, data parallelism (replica-based) is used for which goal, and why is it simpler than intra-model parallelism?

It increases throughput by routing requests to independent model copies; simpler because replicas share no state

Replica-based data parallelism scales throughput by routing extra requests to idle copies of the full model. Each replica is independent, so no inter-GPU synchronization is needed during inference, unlike TP or PP.
For a 70B dense model that does not fit on a single 80 GB H100, which combination of parallelism strategies is standard production practice and why?

TP=2 within each 2-GPU node, then PP across nodes, because TP reduces latency while PP is tolerant of slower inter-node links

TP=2 within a node exploits fast NVLink to halve per-request latency. When the model is too large for one node, PP is added across nodes because it passes activations only at pipeline boundaries, tolerating higher inter-node latency.
Sequence parallelism (SP) at inference time is most beneficial in which specific scenario?

Long-context prefill that does not fit in a single GPU's HBM even with TP

SP shards the sequence dimension across GPUs, distributing the O(s^2) attention memory and compute. This is only meaningful when the sequence itself is so long that even after TP splitting, the activation tensors overflow HBM during prefill.

Ch 29: Prefix caching, prompt caching, radix attention

What is the primary data structure that RadixAttention (SGLang) uses to allow many requests with partially overlapping prefixes to share cached KV entries?

A radix tree where each node represents a token sequence segment and child nodes extend that prefix

SGLang's RadixAttention organizes cached KV blocks as a radix tree. Different requests that share a common prefix share the path from root to the divergence point, while their unique suffixes branch off as separate child nodes, maximizing reuse.
Which eviction policy is most appropriate for a prefix cache that serves a chat application with a fixed system prompt shared by all users?

LRU with a pinning mechanism so recently and frequently used prefixes (like the system prompt) are protected from eviction

LRU naturally protects a hot system prompt because it is touched by every request and thus always appears recently used. Pinning or high LRU priority ensures that the most-shared prefixes survive cache pressure from unique per-user suffixes.
Prefix caching reduces prefill TTFT only on cache hits. Which workload pattern produces the highest cache hit rate?

High-volume chat with a long fixed system prompt and short diverse user turns

A long fixed system prompt is identical across every request, guaranteeing a hit for those tokens on every request after the first warm-up. Short unique user turns add only a small unique suffix that must still be prefilled, but the large shared portion is free.
A serving system uses prefix caching with PagedAttention. A new request partially matches a cached prefix but the last cached block contains tokens that differ from the new request. What does the system do?

Reuses all fully matching blocks and recomputes only the partial or non-matching tail blocks, using copy-on-write for diverging pages

PagedAttention's block granularity means fully matching blocks are reused directly. The first diverging block triggers a copy-on-write (or is simply recomputed), ensuring correctness while maximizing the reused portion.

Ch 30: Cost modeling for inference: tokens, GPUs, dollars

A 70B model runs on one H100 at $3/hour and sustains 1000 tokens/sec total throughput. What is the cost per million output tokens?

$0.83

1000 tokens/sec * 3600 sec/hr = 3.6M tokens/hr. $3/hr divided by 3.6M tokens/hr = $0.83 per million tokens. This is the fundamental capacity-planning formula.
Why does self-hosting a large model only beat API pricing at high GPU utilization rates?

Reserved GPU costs are amortized over all hours whether the GPU is busy or idle, so low utilization raises the effective per-token cost dramatically

With reserved or owned hardware you pay for GPU-hours regardless of load. At 10% utilization you are paying 10x the per-token cost versus 100% utilization. The API charges only for tokens consumed, so it wins below the break-even utilization.
In the memory-bound decode regime, which hardware parameter most directly determines maximum tokens-per-second for a fixed model?

HBM bandwidth because the bottleneck is loading model weights each decode step

Decode is memory-bandwidth-bound at arithmetic intensity near 1 FLOP/byte. The limiting factor is how fast weights can be streamed from HBM to compute units each step, not the number of FLOPs the GPU can perform.
An H200 has 4.8 TB/s HBM bandwidth versus the H100's 3.35 TB/s. Ignoring all other factors, by approximately what factor does the H200 improve maximum decode throughput for the same model?

1.43x, because throughput scales linearly with HBM bandwidth in the memory-bound regime

In the memory-bandwidth-bound regime tokens-per-second scales linearly with HBM bandwidth. 4.8 / 3.35 = 1.43x, so the H200 produces roughly 43% more decode throughput for the same model size.

Ch 31: Latency budgets, tail latency, and the p99 problem

A production chat service has good mean TTFT but very high p99 TTFT. Which root cause most likely explains this pattern?

Occasional long-prompt requests joining the batch cause short requests to queue behind expensive prefills

Long prefill requests are compute-bound and block the scheduler. Short requests that arrive while a long prefill is in flight must wait in queue, producing high TTFT tail even though the mean (dominated by the common short case) looks fine.
Little's Law states L = lambda * W. In the context of LLM serving, what do L, lambda, and W represent?

L = average number of requests in the system, lambda = request arrival rate, W = average time a request spends in the system

Little's Law is a queueing theory identity: average queue length equals arrival rate times average sojourn time. For an LLM serving system it lets you reason about how increasing load (lambda) inflates end-to-end latency (W) as the system fills up.
Why does end-to-end latency variance in LLM serving grow much wider than in a typical web service under similar load?

LLM output length is unpredictable at admission time, so requests that generate many tokens consume far more resources than short ones, creating high-variance batch composition

Unlike a web service where request cost is roughly fixed, LLM output length is unknown until the EOS token is generated. A request generating 2000 tokens occupies batch slots for 40x longer than one generating 50 tokens, creating enormous variance in resource consumption.
A team sets their TTFT SLO at p95 less than 1s and their TPOT SLO at p99 less than 50ms. A new workload introduces 10% of requests with 8k-token prompts. Which SLO is most at risk and why?

TTFT p95, because the 8k-token prefills are expensive and will push the p95 of TTFT up directly

8k-token prefills are expensive compute-bound operations. At 10% of traffic, they appear frequently enough to impact p95 TTFT directly by queuing shorter requests behind them. TPOT is affected less directly since decode cost depends on the decoding request's KV cache length, not the prefill request.

Ch 32: Multimodal: vision-language, audio, the tokenizer trick

A 336x336 image is processed by a ViT with 14x14 patches. How many image tokens does this produce, and why does this dramatically increase TTFT?

576 tokens, because prefill must process all image tokens along with the text prompt in one compute-bound pass

(336/14)^2 = 576 patch tokens. These are concatenated with the text prompt and processed together in prefill. A typical text prompt of 50 tokens becomes a 626-token sequence, making prefill more than 10x more expensive for the same text query.
What is the key architectural difference between early fusion and cross-attention fusion in vision-language models?

Early fusion concatenates image tokens with text tokens as a flat sequence for the LLM; cross-attention adds dedicated attention layers where text queries attend to image features separately

Early fusion (used in LLaVA, Qwen-VL, etc.) injects image tokens directly into the token sequence. Cross-attention fusion (used in Flamingo, etc.) keeps image representations in a separate stream and adds special cross-attention layers, which can be more parameter-efficient but architecturally more complex.
Why does disaggregated prefill-decode (Chapter 36) provide a larger benefit for vision-language workloads than for text-only workloads?

Image tokens inflate prefill length dramatically, widening the gap between compute-bound prefill and memory-bound decode and making separation more valuable

Each image in a VL request adds hundreds of tokens to the prefill phase. This makes the prefill much more compute-intensive relative to the decode, sharpening the asymmetry that disaggregation exploits. The benefit scales with how much heavier prefill is versus decode.
A model is asked to process a video with 30 frames at 1 FPS over 30 seconds, each frame producing 256 tokens. What serving challenge does this create that a text-only workload does not?

The 7680-token context from video tokens forces the KV cache to a size that may evict active text sessions even with PagedAttention

30 frames times 256 tokens equals 7680 image tokens per request. At this scale the KV cache demand per request is enormous and directly competes with memory needed for other concurrent requests, creating severe memory pressure that text workloads avoid.

Ch 33: The attention compression family: MHA, MQA, GQA, MLA

Llama 3 70B uses GQA with 8 KV heads versus MHA with 64 query heads. By what factor does this reduce per-token KV cache size compared to MHA?

8x reduction

GQA reduces KV heads from 64 to 8, an 8x reduction. Since per-token KV cache size is proportional to n_kv_heads, this directly cuts the cache size by 8x, allowing 8x more concurrent users or 8x longer contexts at the same memory cost.
Multi-Query Attention (MQA) uses a single KV head shared by all query heads. What is the main quality downside compared to GQA?

Sharing one KV head across all queries reduces the model's ability to represent diverse attention patterns, leading to quality degradation especially on complex tasks

MQA is an extreme compression that forces all query heads to attend to the same key-value representation. Different heads can no longer specialize to different aspects of the context, which reduces model expressiveness and hurts quality on tasks requiring diverse attention.
DeepSeek's MLA compresses KV heads by projecting K and V into a low-rank latent space. What does this allow that GQA does not?

MLA stores only the compressed latent vector per token in the KV cache rather than the full K and V tensors, reducing cache size below even GQA with few heads

MLA stores a single low-rank latent vector per token and reconstructs K and V at attention time. The latent dimension can be much smaller than the full KV dimension, so the per-token cache footprint is proportional to the latent rank rather than n_kv_heads times head_dim.
A model was trained with MHA and you want to continue pretraining it with GQA to reduce serving costs. What is the standard approach for initializing the GQA KV heads from the MHA checkpoint?

Assign each GQA KV head as the mean-pooled average of the MHA KV heads in its group, preserving average representation quality

The standard GQA uptraining recipe (Ainslie et al. 2023) initializes each grouped KV head by mean-pooling the original MHA KV heads in the corresponding group. This preserves signal from all original heads and provides a warm start for continued training.

Ch 34: Mixture of Experts: routing, balancing, and the inference cost story

A MoE model has 671B total parameters with top-2 routing from 128 experts, giving 37B active parameters per token. Why is the memory bandwidth cost per token higher than a 37B dense model despite identical active compute?

All 671B parameters must reside in HBM across GPUs, so the full weight bandwidth cost is amortized over fewer tokens per expert compared to a dense model of that active size

All expert weights must be loaded into GPU memory even though each token only uses 2 experts. When the batch size per expert is small (low load), the memory-bandwidth-per-token ratio is much worse than a dense model with the same active parameter count.
What is load imbalance in MoE routing and why does it degrade throughput?

Load imbalance occurs when the router sends all tokens to the same top-1 expert, causing that expert to overflow its token buffer while others are idle

If the router consistently prefers a few popular experts, those experts become bottlenecks while others sit idle. In expert parallelism each GPU handles a subset of experts, so one overloaded expert GPU stalls the entire batch, degrading throughput.
Why does a MoE model generally require expert parallelism (EP) in addition to tensor parallelism for large-scale inference?

Expert weights are too large to shard with TP but each expert fits on one GPU, so routing tokens to the GPU holding the assigned expert is more efficient than replicating all experts everywhere

With many large experts, storing all experts on every GPU would require enormous memory. EP assigns different experts to different GPUs and routes tokens across GPUs via all-to-all communication. TP is still used within dense layers (attention), but the FFN experts are handled by EP.
DeepSeek-V3 uses fine-grained MoE with 256 experts and top-8 routing rather than coarse MoE with 8 experts top-2. What advantage does fine-grained routing provide, and what challenge does it introduce?

Fine-grained MoE allows each token to combine more diverse expert specializations for higher model quality, but the increased number of experts makes load balancing and all-to-all routing more complex at scale

More experts with higher top-k allows each token to aggregate knowledge from a wider set of specialists, improving model quality. The challenge is that more experts spread across more GPUs increases all-to-all communication complexity and requires careful load balancing to avoid stragglers.

Ch 35: Long context: RoPE, YaRN, position interpolation, ring attention, sparse attention

Why does position interpolation (PI) enable context extension without full retraining, and what is its key limitation?

PI remaps positions by scaling each position index so that the original training range covers the new longer context, but compressed positions reduce the model's ability to distinguish nearby tokens

PI compresses positions from the extended range into the training range by dividing position indices by the scale factor. The model has seen all resulting positions during training, so it can handle them. However, compression makes nearby positions closer together in angle, reducing the model's ability to differentiate them.
YaRN improves over naive NTK-aware scaling primarily by doing what?

Interpolating only the high-frequency RoPE dimensions (short-range) while leaving low-frequency dimensions (long-range) unscaled, preserving both local and global position resolution

YaRN's insight is that different RoPE frequency bands serve different roles. High-frequency dimensions track local token order and need interpolation at long context; low-frequency dimensions track long-range structure and should be left alone. This mixed strategy preserves quality better than applying uniform scaling.
Ring attention distributes long-context attention across multiple GPUs by partitioning which dimension?

The sequence dimension, with each GPU holding a shard of the query, key, and value sequence while keys and values are passed in a ring to enable full attention

Ring attention assigns each GPU a contiguous shard of the sequence. Keys and values are rotated around the ring of GPUs in a pipeline, so each GPU computes partial attention scores for its query shard against all key-value shards. The sequence sharding makes memory linear in the number of GPUs.
A model is advertised as supporting 1M token context. Which statement is most accurate about its practical utility at 900k tokens?

Performance at 900k tokens is likely degraded because training data rarely contains examples near the maximum length and because attention quality drops in the out-of-distribution position range

Even with context extension techniques, models trained with long-context fine-tuning have seen far fewer examples near the maximum length. Quality degrades near the edges of the context window, a phenomenon called the 'lost in the middle' effect. Advertised context length is an upper bound, not a quality guarantee.

Ch 36: Disaggregated prefill/decode: production reality with workload-dependent payoff

What is the core inefficiency that disaggregated prefill-decode serving addresses?

Co-located prefill and decode compete for the same GPU's compute and memory bandwidth even though they have fundamentally different bottlenecks

Prefill is compute-bound at AI ~700 while decode is memory-bandwidth-bound at AI ~1. Running them on the same GPU means each phase competes for resources it does not need, reducing utilization of both. Disaggregation lets each phase run on hardware optimized for its bottleneck.
After prefill completes on the prefill GPU pool, what must be transferred to the decode GPU pool, and why is this the main operational cost of disaggregation?

The KV cache for the prompt tokens, because decode needs those key-value pairs to attend over during generation

Decode must attend over all previously computed tokens. The KV cache built during prefill encodes those tokens and must transfer from the prefill GPU to the decode GPU. At hundreds of kilobytes per token for large models, this transfer cost is non-trivial and must be hidden by network bandwidth.
For which workload type does disaggregated serving provide the largest throughput benefit?

Vision-language or long-document workloads with very long prefill phases and moderate output lengths

The benefit grows with how much prefill dominates over decode. VL workloads add hundreds of image tokens per request, making prefill extremely expensive relative to decode. Disaggregation lets prefill GPUs be heavily utilized for compute while decode GPUs are optimized for memory bandwidth, yielding large throughput improvements.
A team deploys disaggregated serving with KEDA autoscaling. During a traffic spike, the prefill pool scales up before the decode pool. What immediate symptom appears?

TTFT improves while p99 TPOT degrades because decode GPUs become the bottleneck for the higher volume of prefilled requests

More prefill capacity means more requests complete prefill and are ready for decode simultaneously. If the decode pool has not scaled up yet, these requests queue at the decode tier, inflating TPOT and end-to-end latency even though TTFT per request improves.

Ch 37: KV cache compression and offload: LMCache, RDMA, NVMe tiering

When should you prefer fetching KV cache from CPU DRAM rather than recomputing it from scratch?

When the DRAM fetch time is less than the prefill recompute time for those tokens, which depends on sequence length and available compute

Recompute cost scales with prefill compute while fetch cost scales with cache size and DRAM bandwidth. For long sequences on underpowered hardware, fetching is faster. For short sequences or compute-rich hardware, recomputing avoids the DRAM round-trip entirely.
What is the primary motivation for RDMA-based remote KV cache sharing across replicas?

To allow a request that routes to one replica to reuse the prefix KV cache computed by a different replica, avoiding redundant prefill across the fleet

In a multi-replica deployment, two replicas may independently compute the KV cache for the same popular system prompt. RDMA lets one replica read the already-computed cache from another's CPU DRAM or HBM over a fast network, avoiding duplicate prefill work across the fleet.
NVMe offload is most useful for which KV cache pattern?

Long-lived session caches such as multi-turn conversations that are inactive between turns but too large to keep in HBM

NVMe has 100-microsecond latency and gigabytes-per-second bandwidth, making it suitable for KV caches that are accessed infrequently but need to persist. Multi-turn conversations that sit idle between user messages are ideal: the cache is too large for HBM but needs to survive until the next turn.
LMCache uses a semantic hash for KV cache lookup rather than a token-sequence hash. What does this enable that a pure token hash cannot?

Cache hits across requests that have the same meaning expressed with different tokens, such as paraphrased system prompts

A semantic hash maps similar content to the same bucket, allowing cache reuse even when the exact token sequence differs. This handles cases like system prompts rewritten with synonyms or slight reformatting that would miss a token-exact hash but produce nearly identical KV representations.

Ch 38: Hardware-aware kernel design: CUDA, CUTLASS, Triton, TVM

FlashAttention is faster than a naive attention implementation primarily because it avoids which bottleneck?

It avoids repeatedly writing and reading the large intermediate attention score matrix to HBM by fusing the softmax and matmul into one tiled kernel

The naive implementation writes the full S = QK^T matrix to HBM, reads it back for softmax, then writes again. For long sequences this is terabytes of HBM traffic. FlashAttention tiles the computation so the intermediate matrix is kept in fast SRAM, dramatically reducing HBM round-trips.
Triton's main advantage over raw CUDA for writing custom LLM kernels is best described as which?

Triton abstracts over thread blocks and shared memory in Python, allowing domain experts to write high-performance kernels without managing low-level CUDA thread indexing

Triton exposes a tile-based abstraction in Python where the programmer thinks in blocks of data rather than individual threads. This hides warp divergence, shared memory bank conflicts, and register allocation while still generating near-optimal PTX, making kernel authorship accessible to ML engineers.
torch.compile with a custom CUDA kernel inserted as a custom op is used in production vLLM for paged attention. What does torch.compile add that the custom kernel alone does not provide?

It enables operator fusion across the custom kernel and surrounding PyTorch ops, reducing kernel launch overhead and HBM traffic for adjacent operations

torch.compile traces the computation graph and can fuse adjacent elementwise operations with the custom kernel, eliminating intermediate tensor writes to HBM between ops. The custom kernel alone executes in isolation; torch.compile integrates it into the broader operator fusion pipeline.
CUTLASS is described as a template library rather than a compiler or a DSL. What is the practical implication of this design for an ML systems engineer writing a new attention variant?

Using CUTLASS requires deep knowledge of CUDA thread block geometry and memory hierarchy because you compose hand-written C++ templates that map directly to GPU micro-architecture concepts

CUTLASS templates are highly parameterized C++ components that correspond to concepts like warp tiles, shared memory layouts, and pipeline stages. Getting correct and fast CUTLASS code requires understanding how the GPU hardware executes matmuls, making it powerful but demanding compared to Triton.

Other parts

Part I Part II Part III Part IV Part V Part VI Part VII Part VIII Part IX Part X Part XI