Chapter 35: Long context: RoPE, YaRN, position interpolation, ring attention, sparse attention

We touched on long context in Chapters 7 (RoPE) and 10 (advertised vs effective context). This chapter goes deep on the techniques that make long context work — and the techniques that make it appear to work without actually working. By the end you’ll know:

Why position encodings limit context length and what RoPE actually does.
The family of context extension techniques (PI, NTK, YaRN) and how each works.
Why “1M context” is rarely 1M effective.
How ring attention distributes attention across GPUs.
How sparse attention variants try to break the quadratic.
What works in production and what’s still research.

Outline:

The position encoding constraint.
RoPE in detail.
The naive context extension and why it fails.
Position interpolation (PI).
NTK-aware scaling.
YaRN.
Attention bandwidth and the long-context bottleneck.
Ring attention.
Sparse attention variants.
The state of the art in long context.

35.1 The position encoding constraint

Recall from Chapter 6: attention is permutation-equivariant. Without something to break the symmetry, the model can’t distinguish “the cat sat on the mat” from “the mat sat on the cat.” Position encodings inject ordering information into the input.

The challenge with long context is that position encodings learned at one context length don’t transfer to a longer context length. A model trained with positions 0..4095 has never seen position 4096. Without some way to handle out-of-distribution positions, the model produces garbage at positions beyond its training length.

Specifically, the failure mode depends on the position encoding:

Absolute learned positions (BERT, GPT-2): the model has a learned embedding for each position 0..max_train_pos. Position 4096+ doesn’t have an embedding at all. The model crashes (or has to fall back to a random embedding).
Sinusoidal positions (original Transformer): the encoding is defined for any position, so the model can technically run on positions beyond training. But the model has never seen those high-frequency phases during training, so the resulting attention is noisy.
RoPE: similar to sinusoidal — defined for any position, but never seen during training at the longer range. RoPE is more graceful in degradation than absolute embeddings, but the quality still drops sharply past the training length.

The general principle: a model is only good at the context lengths it was trained on. Extending context beyond training requires either retraining, fine-tuning, or clever inference-time tricks.

35.2 RoPE in detail

We introduced RoPE (Rotary Position Embedding) in Chapter 7. Let me go deeper.

The RoPE construction:

RoPE applies a 2D rotation to each dimension pair of the query and key — low-index pairs encode local position (fast rotation), high-index pairs encode global position (slow rotation), giving the model a multi-scale positional signal.

Take a token’s query vector q of dimension d_h.
Split it into pairs: (q_0, q_1), (q_2, q_3), ..., (q_{d_h-2}, q_{d_h-1}).
For each pair (q_{2i}, q_{2i+1}), compute a rotation angle that depends on the position pos and the pair index i:

θ_i = pos × base^(-2i / d_h)

where base = 10000 is a hyperparameter (the original choice from sinusoidal encodings).

Rotate the pair by angle θ_i:

new_q_{2i}     = q_{2i} × cos(θ_i) - q_{2i+1} × sin(θ_i)
new_q_{2i+1}   = q_{2i} × sin(θ_i) + q_{2i+1} × cos(θ_i)

The same rotation is applied to the keys. The values are not rotated.

The key property: when you compute a dot product between a rotated query at position m and a rotated key at position n, the result depends only on the relative position m - n:

(R(m θ) q) · (R(n θ) k) = q · R((m-n) θ) k

This is the magic. Even though RoPE rotates queries and keys based on absolute position, the resulting attention scores depend only on the relative position. The model effectively has relative position information even though it’s computed from absolute positions.

The hyperparameter base = 10000 controls the frequency schedule. Lower-index pairs get higher frequencies (rotate fast as position increases); higher-index pairs get lower frequencies (rotate slowly). This is the same idea as Fourier basis functions — the model has access to multiple “wavelengths” of position information.

The frequencies are chosen such that the slowest rotation (at the highest index pair) takes a full revolution over a context length of about 2π × base^((d_h-2)/d_h) ≈ 2π × 10000 ≈ 60,000 tokens. This is the practical “natural” context length of RoPE with the default base — beyond that, even the slowest rotation has wrapped around, and position information becomes ambiguous.

This is why a RoPE model trained at 4k context can sometimes handle 16k or 32k inference without retraining — the rotations are still in-distribution. But going to 128k or 1M with the default base produces weird artifacts because the rotations are far outside what the model saw during training.

35.3 Naive context extension and why it fails

The simplest approach to “make this 4k-trained model handle 16k”: just feed it 16k tokens at inference and see what happens. With RoPE, this technically works — the model produces output instead of crashing. But the quality is bad.

The reason: at position 16k, the rotation angles are 4× what the model saw during training. The high-index pairs (which use low frequencies) are still in-distribution, but the low-index pairs (which use high frequencies) have rotated through angles the model has never seen. The attention scores for those pairs become unreliable.

The empirical result: a Llama 7B trained at 4k context, run naively at 16k context, scores significantly worse on long-context tasks than at 4k. The needle-in-a-haystack benchmark shows the degradation clearly: the model can find the needle if it’s near the start or near the end (where positions are more like training) but fails in the middle.

So naive extension doesn’t work. The fix is to modify the position encoding at inference time to keep the angles in the trained distribution, even at the longer length. This is the family of techniques called position interpolation.

35.4 Position interpolation (PI)

Position interpolation (Chen et al., 2023) is the simplest extension trick. The idea: at inference time, scale down the position indices so that the new longer context maps to the same range of angles the model was trained on.

Concretely, if the model was trained at context length L_train and you want to use it at length L_target = α × L_train, then at inference time you compute the rotation angle as if the position were pos / α instead of pos:

θ_i = (pos / α) × base^(-2i / d_h)

For α = 4 (going from 4k to 16k context), the rotations at the longest training position (4096) now happen at the new longest position (16384). The angles are the same; the model sees the same rotations.

The catch: positions are now closer together in angle space. Two tokens that were 1 position apart at training time are now only 0.25 angles apart at inference. The model loses some of its ability to distinguish nearby tokens.

Position interpolation uniformly squashes all positions to fit the training range, blurring nearby-token discrimination; NTK-aware scaling raises the base so high frequencies stay sharp while low frequencies extend to the new context length.

In practice, position interpolation works much better than naive extension but not perfectly. A model that scores 85% on a 4k task might score 80% on the same task at 16k with PI. The degradation is small but real, especially for tasks that require fine-grained positional reasoning.

PI was the first widely-adopted context extension technique. It’s simple, requires no retraining, and gets you 4× context extension with modest quality loss. Most early Llama 2 long-context fine-tunes used PI.

35.5 NTK-aware scaling

The next refinement: NTK-aware scaling (Reddit user u/bloc97, 2023, then formalized by various groups). The observation: position interpolation scales all the rotation frequencies uniformly, which compresses the high-frequency components and loses the model’s ability to distinguish nearby tokens.

NTK-aware scaling does something smarter: scale only the lower frequencies, leave the higher frequencies alone. The high-frequency pairs (which encode local position information) are kept at their original training distribution, while the low-frequency pairs (which encode long-range position information) are stretched to cover the longer context.

The key formula change: instead of dividing position by α, change the base:

new_base = base × α^(d_h / (d_h - 2))

With the new base, the lowest-frequency rotation has the right wavelength for the new context length, while the highest-frequency rotations are essentially unchanged.

NTK-aware scaling preserves more of the model’s local-token discrimination while still extending the context. Empirically, it does ~1-2 points better than position interpolation on long-context benchmarks.

Both PI and NTK-aware scaling are inference-time-only techniques — no retraining needed. You just modify how the rotation angles are computed at inference. This makes them easy to deploy.

35.6 YaRN

YaRN (Yet another RoPE extension method) (Peng et al., 2023) is the current state-of-the-art context extension technique. It combines:

NTK-aware scaling for the frequency adjustment.
Length-aware temperature scaling for the attention logits to compensate for the changed distribution.
A small amount of fine-tuning at the new context length to “bake in” the changes.

The key insight: even with NTK-aware scaling, the softmax temperature of attention is implicitly affected by the length. At longer contexts, attention has more positions to compete for, and the softmax tends to flatten. YaRN explicitly compensates by scaling the attention logits by a length-dependent factor.

The resulting model handles much longer contexts with quality close to the original training distribution. YaRN-fine-tuned models routinely extend from 4k to 128k context with minimal quality loss.

YaRN is now the standard fine-tuning recipe for long-context Llama variants. Any open-source “Llama 3 with 128k context” model has almost certainly been fine-tuned with YaRN.

The catch is that YaRN requires fine-tuning (a few hours of compute for a small model). Pure inference-time tricks (PI, NTK-aware) are easier to apply but achieve less.

35.7 Attention bandwidth and the long-context bottleneck

Even if you’ve extended the position encoding to 128k or 1M, you still have the attention compute and memory cost at long context. Recall from Chapter 6 that attention is O(s²) in compute and memory.

For a 128k-context request on Llama 3 70B:

KV cache: 128k × 320 KB/token = 40 GB. Per request. (One H100 holds 80 GB total, so a single 128k request fills half a GPU’s memory.)
Attention compute per layer: ~128k² × d_h × num_heads FLOPs. For 80 layers, prefill on a 128k prompt is ~10^16 FLOPs. On an H100 at 989 TFLOP/s with 30% MFU, that’s ~30 seconds of prefill.

Both are significant. The KV cache memory is the main constraint at very long context — it scales linearly, but the absolute size is large. Compute is the secondary constraint, especially for prefill.

For a 1M-context request: KV cache is 320 GB per request (multi-GPU just for one request) and prefill is hundreds of TFLOPs. The cost is huge.

The techniques to reduce these costs:

GQA / MLA (Chapter 33): smaller per-token KV cache.
KV cache quantization (Chapter 26): smaller cache via lower precision.
Paged attention (Chapter 24): more efficient memory utilization.
Sparse attention (next section): avoid the O(s²) by attending to only a subset.
Ring attention (next-next section): distribute attention across GPUs.

Each of these helps a different part of the bottleneck. None is sufficient alone for serving 1M-context requests at scale.

35.8 Ring attention

Ring attention (Liu et al., 2023) is a parallelism technique for attention that distributes the sequence dimension across GPUs. We touched on it in Chapter 28; here’s the deeper version.

The setup: a long sequence is split into K chunks, one chunk per GPU.

Ring attention passes KV chunks around a ring of GPUs one step at a time; each GPU accumulates attention to progressively more of the sequence, achieving O(s²) total compute but O(s/K) memory per GPU.

Each GPU holds the K and V for its chunk. To compute attention for the queries in its chunk, each GPU needs access to **all** the K and V from all chunks (because attention is a global operation).

The naive solution would be to gather all the K and V on one GPU, but that defeats the purpose. Ring attention is cleverer: at each step, the K and V chunks are passed around the ring of GPUs. GPU i first attends its queries to its own K, V (local). Then it receives K, V from GPU i-1 and attends to those. Then it receives K, V from GPU i-2. And so on, until each GPU has attended to all chunks.

The communication is block-by-block ring transfer, which is bandwidth-efficient. The compute is the standard attention math, just done piecewise.

The result: a sequence of length s can be processed with attention compute O(s²) but memory O(s/K) per GPU, where K is the number of GPUs in the ring. For a 1M-context request on 8 GPUs, each GPU holds 128k tokens of KV cache (~40 GB) and the total compute is the same as the single-GPU case but distributed.

Ring attention is the technique behind the “1M context window” claims in models like Gemini 1.5 and Llama 3 with extended context. Without ring attention (or something equivalent), you can’t fit a 1M-context request in any reasonable amount of GPU memory.

The catch: ring attention has high latency for prefill on long sequences. Each step of the ring is a synchronous communication round. For 8 GPUs and a 1M-context prefill, you need many ring rounds, and the per-round latency adds up.

For inference at long context, ring attention is the right tool. It’s complex to implement and requires careful integration with the serving stack, but it’s the only way to handle 1M+ contexts on commodity hardware.

35.9 Sparse attention variants

A different angle: don’t compute attention to every position. Instead, attend only to a sparse subset of the past, and skip the rest. This breaks the O(s²) quadratic.

The classical sparse attention patterns:

Sparse attention masks trade the full lower-triangular pattern for structured subsets — windowed attention retains only a local neighborhood, strided adds periodic global anchors, but neither reliably recovers full-attention quality for general language tasks.

Local (windowed) attention. Each token only attends to the previous w tokens (a sliding window). The cost is O(s × w) instead of O(s²). Used in Longformer (Beltagy et al., 2020) and many “efficient transformer” variants.

Strided / dilated attention. Each token attends to every k-th position. Captures long-range patterns at lower cost.

Block-sparse attention. Divide the sequence into blocks; attend only between certain blocks based on a fixed pattern.

Routing-based sparse attention. A learned router decides which positions to attend to. More flexible but harder to train.

The various sparse attention papers achieve impressive theoretical complexity (Big Bird is O(s × log s), Reformer is O(s × log s), Performer is O(s)), but they have one common problem: they don’t preserve quality at competitive levels for general language modeling.

The reason is that natural language has long-range dependencies that don’t fit clean patterns. A model that only attends to the previous 256 tokens fails at tasks that require reaching back to the system prompt 8000 tokens earlier. A strided pattern misses the specific position needed.

The production reality: sparse attention is mostly relegated to specific tasks (very long documents where local patterns dominate, certain coding tasks) and doesn’t replace full attention for general LLMs. Modern long-context models (Llama 3.1 with 128k, Qwen 2.5 with 1M) use full attention with the various extension and parallelism techniques in this chapter, not sparse attention.

The exception is Mistral’s sliding window attention, which combines a sliding window with full attention: each token attends to the last w = 4096 tokens via local attention. The full attention is still used for the prompt-position to global-position interaction. This is a compromise that works well for chat-like workloads but gives up some long-range modeling.

35.10 The state of long context

A summary of where the field is in late 2025:

Model	Advertised context	Effective (RULER)	Technique
GPT-4 Turbo	128k	~32k	unknown
Claude 3.5 Sonnet	200k	~64k	unknown
Gemini 1.5 Pro	1M	~128k	ring attention + ?
Llama 3.1 70B	128k	~32k	YaRN
Llama 3.1 405B	128k	~32k	YaRN
Qwen 2.5 72B	128k	~64k	YaRN
Qwen 2.5-1M	1M	~256k	YaRN + DCA
DeepSeek-V3	128k	~64k	YaRN + MLA

The pattern: advertised context is usually 4× the effective context for retrieval tasks, and worse for reasoning tasks.

The “1M context” models (Gemini, Qwen 2.5-1M) are the frontier. They use ring attention for the parallelism, MLA or GQA-8 for the KV cache, YaRN-style extension for the position encoding, and aggressive fine-tuning with long sequences. The compute and memory cost is enormous — Gemini 1.5 Pro at full 1M context costs much more per request than the same model at 32k context.

For production, the practical answer for long context is: use the model at no more than half its advertised context, use YaRN-tuned variants, accept that the cost scales linearly with sequence length, and consider whether your task actually needs the long context (often, RAG or chunking is better).

The skill is recognizing that “1M context” is a marketing number for most models, and that the real context is much smaller.

35.11 The mental model

Eight points to take into Chapter 36:

Position encodings limit the trained context length. Naive extension produces garbage.
RoPE rotates query and key vectors by position-dependent angles. The dot product depends on relative position.
Position interpolation (PI) scales position indices down to keep angles in distribution. Simple, works.
NTK-aware scaling adjusts the RoPE base instead, preserving high-frequency precision.
YaRN combines NTK-aware scaling with attention temperature compensation and a small fine-tune. Current standard for long context.
Attention is O(s²); long context requires KV cache compression, quantization, and parallelism.
Ring attention distributes the sequence dimension across GPUs via a ring of K/V transfers.
Sparse attention is mostly research; production long-context models use full attention with the techniques above.

In Chapter 36 we cover the production reality of disaggregated prefill/decode — where the prefill/decode asymmetry from Chapter 21 meets actual deployment.

Read it yourself

Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding (2021). The RoPE paper.
Chen et al., Extending Context Window of Large Language Models via Positional Interpolation (2023).
Peng et al., YaRN: Efficient Context Window Extension of Large Language Models (2023).
Liu et al., Ring Attention with Blockwise Transformers for Near-Infinite Context (2023).
Hsieh et al., RULER: What’s the Real Context Size of Your Long-Context Language Models? (2024).
The Beltagy et al., Longformer paper (2020) for sparse attention.
The Mistral 7B paper for sliding window attention.

Practice

Compute the rotation angle for RoPE at position 4096 with d_h = 128 and the default base 10000, for the lowest-frequency pair (i = 63). What about for the highest-frequency pair (i = 0)?
Why does NTK-aware scaling preserve high-frequency precision better than position interpolation? Compare the new rotation angles for both at position 16384 with d_h = 128, original training length 4096.
For a Llama 3 70B at 128k context, compute the per-request KV cache memory in bf16 and in INT4. Which fits on a single H100?
Why is ring attention’s communication pattern bandwidth-efficient? Trace the data flow.
Why does sparse attention struggle to match full attention quality on general language tasks? Argue at the level of “what are language dependencies.”
Read the YaRN paper. Identify the three contributions and explain why each helps.
Stretch: Fine-tune a 7B Llama on a small long-context dataset using YaRN-style position scaling. Measure RULER scores at various target contexts. Compare to a no-extension baseline.

Concept check