Chapter 33: The attention compression family: MHA, MQA, GQA, MLA

Welcome to Stage 3 of Part III: the research frontier. The previous chapters (Stage 2) covered the practitioner internals you need to be effective in production. The next nine chapters cover the cutting-edge research that’s actively shaping how the next generation of models is built. By the end of Stage 3 you should be able to read any new architecture paper, place it in context, and predict whether it will matter for production.

We open Stage 3 with the attention compression family: a sequence of architectural choices about how to organize the K and V projections in attention. Each choice trades off model quality, KV cache size, and training stability. The progression from MHA → MQA → GQA → MLA is one of the cleanest examples of “research finds an elegant solution to a production problem” in the LLM era.

This chapter goes deep on why these variants exist, how they differ, and which is right for which use case.

Outline:

The KV cache problem revisited.
MHA — multi-head attention, the original.
MQA — multi-query attention.
GQA — grouped-query attention.
MLA — multi-head latent attention (DeepSeek).
How each affects KV cache size.
How each affects model quality.
The training-quality-vs-inference-cost frontier.
The choices in current models.

33.1 The KV cache problem revisited

Recall from Chapter 22: the per-token KV cache size is

2 × n_layers × n_kv_heads × d_h × bytes

For Llama 3 70B with 80 layers, 8 KV heads, 128 head dim, in bf16, that’s 320 KB per token. For a context of 32k tokens, that’s 10.5 GB of KV cache per request. For 100 concurrent users, 1 TB.

The KV cache is the dominant memory cost of serving any non-trivial LLM. The model weights are static and shared across requests; the KV cache scales as users × context_length. Anything that reduces the per-token KV cache size translates directly to more concurrent users and longer contexts at the same hardware cost.

The most direct way to reduce per-token cache size is to reduce n_kv_heads. The original transformer (and BERT, GPT-2, and most pre-2023 models) has n_kv_heads = n_attention_heads. So if a model has 64 attention heads, it has 64 KV heads, and the KV cache is at full size. Modern models reduce n_kv_heads aggressively, often to 8 or fewer, with techniques that preserve model quality despite the reduction.

This is the attention compression family. Each variant is a different way to use fewer KV heads than query heads.

KV cache per token scales with n_kv_heads — MHA at 64 heads costs 64× more than MQA, while GQA-8 (the modern default) sits at 8× savings with negligible quality loss.

33.2 MHA — multi-head attention, the original

Multi-head attention (MHA) is the original attention from Attention Is All You Need (2017). Each attention head has its own Q, K, and V projections. For a model with H heads of dimension d_h:

W_Q^(i) of shape (d_model, d_h) for each head i
W_K^(i) of shape (d_model, d_h) for each head i
W_V^(i) of shape (d_model, d_h) for each head i

The total parameter count for the QKV projections per layer is 3 × d_model × (H × d_h) = 3 × d_model².

The KV cache stores K and V for every head: n_kv_heads = H. For Llama 1 7B with H = 32, d_h = 128, the per-token KV cache is 2 × 32 × 32 × 128 × 2 = 524,288 bytes = 512 KB. For a 32k context, that’s 16 GB per request.

MHA is the maximum-quality, maximum-cost option. Every head is fully independent and can learn its own attention pattern. The cost is the unaffordably large KV cache.

GPT-3, the original Llama 1, and most pre-2023 transformers used MHA. The reason: at the time, KV cache size wasn’t the dominant concern (people weren’t yet thinking about long-context production serving). Once long-context production serving became the primary use case, MHA’s cost became unsustainable.

MHA gives every query head its own independent K and V; MQA forces all query heads to share a single K and V, collapsing the KV cache by a factor of H but limiting expressiveness.

33.3 MQA — multi-query attention

Multi-query attention (MQA) (Shazeer, 2019) is the extreme opposite of MHA: share a single K and V across all heads. There’s only one K projection and one V projection for the whole layer, but still H separate Q projections.

The K and V are computed once per token and broadcast to all heads. Each head has its own Q but shares the K and V with every other head.

The KV cache is now n_kv_heads = 1. For the same model dimensions, the per-token cache shrinks by a factor of H. For Llama-style models with 32 heads, MQA gives a 32× reduction in KV cache size.

That’s the win. The cost is quality loss. Sharing K and V means every head has to attend to the same projected representation of the past. Heads can no longer specialize their attention patterns based on different views of the input. The model is less expressive.

In practice, MQA at the same parameter count costs about 1-2 percentage points on standard benchmarks compared to MHA. Some models can be retrained from scratch with MQA and recover most of the gap; others can’t.

MQA was used in PaLM and a few other early models. It was a clever idea but the quality loss was real, and the field moved to a compromise: GQA.

33.4 GQA — grouped-query attention

Grouped-query attention (GQA) (Ainslie et al., 2023) is the middle ground between MHA and MQA. Instead of having all heads share one K/V (MQA) or each head having its own K/V (MHA), groups of heads share a K/V.

For a model with H = 32 query heads and n_kv_heads = 8, each KV head is shared by 4 query heads. The 32 query heads are partitioned into 8 groups of 4; each group shares one K projection and one V projection.

The KV cache size is now n_kv_heads = 8, which is 4× smaller than MHA but 8× larger than MQA. The quality is much closer to MHA than to MQA — typically within 0.5 points on benchmarks. GQA gets 75% of the cache savings of MQA with 75% less quality loss. That’s the kind of trade-off engineers love.

GQA can be applied as an “uptrain” — take an MHA-trained model and continue training it with GQA, where the K and V projections of the head groups are initialized as averages of the original heads. After a small amount of additional training, the model recovers its original quality and now has a much smaller KV cache. Llama 2 70B was uptrained from MHA to GQA-8 this way.

GQA has become the default attention architecture for modern LLMs:

Llama 2 / 3 / 3.1 / 3.2 — GQA with 8 KV heads.
Mistral / Mixtral — GQA with 8 KV heads.
Qwen 2 / 2.5 — GQA with 4-8 KV heads depending on model size.
Gemma 2 / 3 — GQA.
Phi-3 / 4 — GQA.

The choice of n_kv_heads = 8 is not magical; it was the empirically discovered sweet spot in the GQA paper. Some models go lower (4 KV heads) for tighter memory budgets at small additional quality cost.

GQA partitions H query heads into G groups, each sharing one K/V pair — the ratio H/G is the cache compression factor, and GQA-8 delivers 8× compression with under 0.5 points of quality cost.

The KV cache savings are real and substantial. For Llama 3 70B (80 layers, 64 attention heads, GQA-8, 128 head dim), the per-token cache is 2 × 80 × 8 × 128 × 2 = 320 KB. With MHA (n_kv_heads = 64), it would be 2 × 80 × 64 × 128 × 2 = 2.6 MB. 8× smaller cache with negligible quality cost.

This is one of the most consequential architectural changes in the modern LLM era. It’s a single hyperparameter — n_kv_heads — and choosing it well is what makes long-context serving affordable.

33.5 MLA — multi-head latent attention

Multi-head latent attention (MLA) is DeepSeek’s contribution (DeepSeek-V2, 2024). It’s the most ambitious compression scheme to date and goes further than GQA.

The trick: instead of reducing the number of KV heads, compress the K and V into a low-rank latent vector and decompress them on the fly during attention.

The architecture:

MLA caches only the low-rank latent vector (512 dims) per token instead of full per-head K and V (H × d_h dims), giving 32–64× cache reduction with no measurable quality loss.

The model has a shared latent dimension d_kv that’s much smaller than d_model. For DeepSeek-V2 with d_model = 5120, d_kv = 512.
Each token’s K and V information is compressed into a single vector of dimension d_kv. This is the latent: one vector per token, not one K vector and one V vector per head.
At attention time, the latent is expanded through learned matrices to produce the per-head K and V. The expansion is per-head (different heads decompress to different K/V) but the underlying latent is shared.
The KV cache stores the latent vectors, not the per-head K and V. So the per-token cache size is 2 × d_kv × bytes, dramatically smaller than even GQA.

Wait, why “2 × d_kv” if we’re storing one latent per token? Because there are two pieces: one for K-related info and one for V-related info. They’re each d_kv / 2 typically. The point is the cache stores compressed latents, not full K and V.

For DeepSeek-V2 with d_kv = 512, the per-token latent KV cache is roughly ~1 KB per layer (compared to ~20 KB per layer for a similar-sized GQA model). Across 60 layers, the total per-token KV cache is ~70 KB, vs ~200 KB+ for GQA-8 in a similar-sized model.

The quality cost is essentially zero. DeepSeek-V2 and V3 demonstrate that MLA matches or exceeds GQA quality at the same parameter count. This is one of the most surprising results of the 2024 architecture wave.

The catches:

More complex inference. The K and V have to be decompressed from the latent at each attention step, which adds a small amount of extra compute. The extra compute is small relative to the overall attention cost, so the trade-off is favorable.
More complex training. MLA requires careful initialization and training tuning. Most labs haven’t replicated it yet.
Special hardware-software integration. Efficient MLA inference requires the serving stack to know about the latent decompression. vLLM and SGLang both have MLA-specific kernels for DeepSeek models.

MLA is the future direction for KV cache compression, but adoption is slow because:

It’s tied to DeepSeek’s specific implementation.
Most labs are still optimizing GQA before reaching for MLA.
The training quality story is less proven outside DeepSeek’s models.

I expect more models to adopt MLA over the next year, especially as DeepSeek-style training recipes become more standard.

33.6 How each affects KV cache size

Putting it all together with concrete numbers for a hypothetical 70B model with H = 64, d_h = 128, n_layers = 80:

Variant	n_kv_heads	Per-token KV cache	Per-token cache vs MHA
MHA	64	2.6 MB	1× (baseline)
GQA-32	32	1.3 MB	2× smaller
GQA-16	16	660 KB	4× smaller
GQA-8	8	330 KB	8× smaller
GQA-4	4	165 KB	16× smaller
GQA-2	2	82 KB	32× smaller
MQA	1	41 KB	64× smaller
MLA (d_kv=512)	n/a	~80 KB	32× smaller (approx)

The 8× savings of GQA-8 over MHA is the modern default. MLA pushes this further to 32-64×. MQA pushes it even further but at quality cost.

For a serving fleet, the choice of attention variant directly determines:

How many concurrent users fit in a given GPU memory budget.
How long contexts can be without running out of memory.
The cost-per-token for any non-trivial workload.

A 70B model with GQA-8 vs MHA has roughly 8× the serving capacity at the same hardware cost — and that’s just from the architectural choice, before any other optimization.

33.7 How each affects quality

The quality story, summarized:

MHA: maximum quality, maximum cost. The baseline.
GQA-32: essentially no quality loss vs MHA. But the cache savings are small (2×).
GQA-16: very small quality loss (<0.1 points). Cache 4× smaller.
GQA-8: small quality loss (<0.5 points typically). Cache 8× smaller. Modern default.
GQA-4: noticeable but small loss (~0.5-1 points). Cache 16× smaller.
GQA-2: larger loss (~1-2 points). Cache 32× smaller.
MQA: largest loss (~1-3 points). Cache H× smaller.
MLA: essentially no quality loss vs MHA. Cache 32-64× smaller. The frontier.

The “uptrain” trick — taking an MHA-trained model and converting to GQA via continued training — works for going from MHA to GQA-8 with a few hundred billion training tokens. The converted model recovers most of its quality. This is how Llama 2 70B went from MHA (in the original 65B) to GQA-8.

Going further (uptraining to GQA-4 or MQA) is harder. Quality recovery is less complete; the model may end up worse than if it had been trained with the smaller n_kv_heads from scratch.

For new training runs, GQA-8 is the default. For models specifically optimized for long-context inference, MLA (or even more aggressive compression) is increasingly common.

33.8 The training-quality-vs-inference-cost frontier

The deeper observation: the choice of n_kv_heads is a Pareto front between training quality and inference cost. There’s no free lunch.

The Pareto frontier of quality vs KV cache size — GQA-8 is the current production sweet spot, and MLA extends the frontier by achieving MHA-level quality at near-MQA cache cost.

Lower n_kv_heads → cheaper inference, harder training, slightly lower quality at convergence.
Higher n_kv_heads → more expensive inference, easier training, slightly higher quality at convergence.

Different models pick different points on this frontier based on their priorities:

Frontier labs serving billions of requests want the cheapest inference, so they lean toward GQA-4 or MLA. They can afford to pay for the harder training to recover quality.
Open-source labs releasing once and moving on care about training simplicity, so they default to GQA-8 (the empirical sweet spot).
Specialized models (e.g., reasoning-focused) might prefer slightly higher quality and accept more expensive serving — Reasoning models often use GQA-16 or even MHA in some experimental variants.

The frontier is moving. As MLA becomes more battle-tested, the “optimal” point shifts toward more aggressive compression. As context windows get longer (because production demands it), KV cache size becomes more dominant relative to other costs, which pushes toward more compression.

33.9 The choices in current models

A summary of what current open models use:

Model	Year	Variant	n_kv_heads (or equivalent)
GPT-3	2020	MHA	96
Llama 1	2023	MHA	32
Llama 2 70B	2023	GQA	8
Llama 3 70B	2024	GQA	8
Llama 3.1 405B	2024	GQA	8
Mistral 7B	2023	GQA	8
Mixtral 8x7B	2023	GQA	8
Qwen 2.5 72B	2024	GQA	8
Gemma 2 27B	2024	GQA	16
Phi-3 medium	2024	GQA	10
DeepSeek-V2	2024	MLA	n/a (latent dim 512)
DeepSeek-V3	2024	MLA	n/a (latent dim 512)

The pattern: everyone is on GQA-8 as the safe default, with DeepSeek pushing into MLA. New models in 2025 are increasingly experimenting with MLA or smaller GQA variants.

The Llama 4 architecture (when released) is widely expected to push further on KV cache compression, possibly adopting MLA or another novel scheme. This is one of the most actively iterated parts of the architecture.

33.10 The mental model

Eight points to take into Chapter 34:

The KV cache is the dominant memory cost of serving. Reducing it directly increases serving capacity.
MHA is the original. Each head has its own K and V. Maximum quality, maximum cost.
MQA shares one K and V across all heads. Maximum compression, real quality cost.
GQA groups heads to share K and V in groups. The compromise. GQA-8 is the modern default.
MLA compresses K and V into a low-rank latent vector. The frontier. Lossless quality, big compression.
GQA-8 vs MHA gives 8× cache savings at <0.5 points quality cost. The biggest single architectural lever.
MLA can give 32-64× cache savings at essentially no quality cost. Used by DeepSeek-V2/V3.
The choice is a Pareto front between training quality and inference cost. No free lunch, but the frontier is moving toward more compression.

In Chapter 34 we cover the other major architectural compression: Mixture of Experts.

Read it yourself

Vaswani et al., Attention Is All You Need (2017). The MHA original.
Shazeer, Fast Transformer Decoding: One Write-Head is All You Need (2019). The MQA paper.
Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (2023).
The DeepSeek-V2 technical report. The MLA section is the introduction of the technique.
The DeepSeek-V3 technical report. Continued use and refinement of MLA.
The Llama 2 paper (Touvron et al., 2023), section on attention. Justifies the GQA choice.

Practice

Compute the KV cache size per token for a 7B model with H=32, d_h=128, n_layers=32 under MHA, GQA-8, GQA-4, and MQA.
Why does GQA-8 give 8× cache savings over MHA without 8× quality loss? Argue at the level of “what are the heads doing.”
Read the DeepSeek-V2 paper’s MLA section. Walk through the latent compression and decompression steps.
Why was MQA quickly displaced by GQA? Identify the specific quality issue with MQA.
For a 70B model serving 100 concurrent users at 4k context each, compute the total KV cache memory under MHA, GQA-8, and MLA. What’s the difference in serving capacity?
Why hasn’t every lab adopted MLA? Argue both technical and adoption-curve reasons.
Stretch: Read the GQA paper’s uptrain experiment (section 4). Why does uptraining from MHA to GQA work? Reproduce the quality recovery curve mentally.

Concept check