Abstract
A 32-layer decoder with 32 query heads and 128-dimensional heads writes a large cache when every query head owns its own key and value head. In FP16, one 32,768-token active sequence needs about 16 GiB of KV cache before allocator overhead, pages, batching, beam search, or metadata.
Change only one architectural number, from 32 KV heads to 8 KV heads, and the same 32,768-token sequence drops to about 4 GiB. No FP8. No INT4. No paging trick. The cache shrinks because the model stopped writing as many key/value vectors in the first place.
Grouped-query attention (GQA) sits in that space. It keeps many query heads, but it shares fewer key/value heads across them. For serving, that makes GQA the upstream lever behind a lot of KV-cache math: quantization changes bytes per value, paging changes allocation efficiency, and GQA changes how many values exist.
For the base attention mechanism, use the Q/K/V attention walkthrough. For the downstream serving story, the two KV pieces cover compression, paging, and drift and FP8/INT4 scale budgets.
1. The Cache Stores K and V, Not Old Q
In decoder inference, every new token produces fresh query, key, and value projections for each layer. The query for the current token is temporary. It asks, "which previous positions should I read?"
The key and value vectors for past tokens are different. They must be kept because every later generated token may attend back to them. That stored history is the KV cache.
Multi-head attention (MHA), multi-query attention (MQA), and grouped-query attention (GQA) differ in how many key/value heads are written:
| Attention type | Query heads | KV heads | What gets shared |
|---|---|---|---|
| MHA | Nothing. Each query head has its own K/V head. | ||
| GQA | Groups of query heads share one K/V head. | ||
| MQA | 1 | All query heads share one K/V head. |
The naming can hide the serving impact. Query heads still exist in GQA and MQA. The model can still compute many attention score patterns for the current token. The difference is in the stored memory: old keys and values have fewer K/V head streams per token.
For an implementation, n_q and n_kv usually need a clean grouping relationship. With 32 query heads and 8 KV heads, each K/V head serves 4 query heads. With 32 query heads and 1 KV head, every query head reads from the same K/V stream.
2. The Formula Changes Before Precision Enters
The raw KV cache footprint per token is:
where:
- accounts for one key and one value.
- is the number of transformer layers.
- is the number of key/value heads.
- is the dimension of each head.
- is bytes per stored value.
The most common counting mistake is to put query heads into this cache formula. Query heads matter for attention computation and model capacity. They do not create stored K/V vectors for previous tokens unless the architecture gives them separate K/V heads.
Hold everything except n_kv constant:
def kv_cache_gib(
layers: int,
kv_heads: int,
head_dim: int,
tokens: int,
bytes_per_value: float,
) -> float:
values_per_token = 2 * layers * kv_heads * head_dim
total_bytes = values_per_token * tokens * bytes_per_value
return total_bytes / (1024 ** 3)
layers = 32
query_heads = 32
head_dim = 128
tokens = 32_768
bytes_per_value = 2.0 # FP16 or BF16 cache values
for kv_heads in [32, 8, 4, 1]:
print(kv_heads, round(kv_cache_gib(layers, kv_heads, head_dim, tokens, bytes_per_value), 2))Expected output:
32 16.0
8 4.0
4 2.0
1 0.5At 32k tokens, the FP16 cache moves like this:
| Shape | FP16 KiB/token | FP16 cache at 32k | ||
|---|---|---|---|---|
| MHA | 32 | 32 | 512 | 16.0 GiB |
| GQA-8 | 32 | 8 | 128 | 4.0 GiB |
| GQA-4 | 32 | 4 | 64 | 2.0 GiB |
| MQA | 32 | 1 | 16 | 0.5 GiB |
The reduction from MHA is:
With 32 query heads and 8 KV heads, GQA stores one quarter as much KV cache as MHA. With one KV head, MQA stores one thirty-second as much.
That ratio applies only to the cache term. It does not mean the whole model runs 4x or 32x faster. Feed-forward layers, projections, logits, sampling, communication, queueing, and kernel launch overheads still exist. The gain is largest when decode is dominated by reading cached K/V.
3. Decode Bandwidth Is Where GQA Shows Up
A long prompt can fit in memory and still serve poorly. During generation, every new token reads the previous K/V history layer by layer. A token generated after a 32,768-token prompt reads roughly 32,768 cached prompt positions; after that token is appended, the next decode step reads one more.
That repeated scan makes decode sensitive to memory bandwidth. Noam Shazeer's MQA paper framed the issue around incremental decoding: training can parallelize across sequence length, but generation repeatedly loads large key/value tensors. MQA reduces those tensors by sharing K/V across heads. GQA keeps the same idea but avoids the all-or-nothing jump to one KV head.
The important separation:
| Phase | What dominates | How GQA helps |
|---|---|---|
| Prefill | Process the prompt and write the initial cache. | Fewer K/V heads reduce cache writes and stored bytes, but attention over prompt tokens still has its own compute cost. |
| Decode | Generate one token at a time while reading past cache. | Fewer K/V heads reduce the bytes read from cache at each decode step. |
| Scheduling | Keep many active sequences resident. | Smaller per-sequence cache allows larger live batches before memory pressure forces eviction or admission limits. |
GQA also interacts with paged cache systems. PagedAttention and similar cache managers reduce fragmentation and make cache allocation more flexible, but pages still contain K/V data. A smaller n_kv means each token occupies fewer bytes inside those pages.
The serving stack still matters. TensorRT-LLM, for example, documents MHA/MQA/GQA generation kernels, paged KV cache support, and INT8/FP8 KV cache modes. Those features answer different questions:
- MQA/GQA changes how many K/V heads are produced.
- Paged KV cache changes how those bytes are allocated and reused.
- FP8/INT8 cache modes change how many bytes each stored value uses.
Mixing the layers together can hide the first design decision. If a model already uses GQA, the KV bill was reduced before any cache quantizer touched the values.
4. Head Sharing Is Not Free Memory
K/V sharing removes independent projection paths. In MHA, each query head can learn its own key and value projection. In MQA, every query head reads from the same key/value projection. GQA chooses a middle point: several query heads share a K/V projection, but not all of them.
That trade-off is why GQA became useful. Shazeer showed that MQA can greatly reduce memory bandwidth for incremental decoding, with some quality cost relative to full MHA. Ainslie et al. later introduced GQA as an intermediate number of K/V heads and reported quality close to MHA with speed closer to MQA after uptraining from multi-head checkpoints.
The quality question depends on workload and model training, not only on the formula. A retrieval-heavy prompt may rely on fine distinctions among many similar passages. A code task may need exact variable references from thousands of tokens back. A casual chat workload may tolerate more sharing. The cache math is deterministic; the model behavior is empirical.
Useful checks before treating GQA as a pure serving win:
- Long-context retrieval accuracy: does the model still select the right source span when many candidates look similar?
- Needle or passkey tasks: does recall degrade at specific positions or context lengths?
- Code completion: do identifiers and imports survive across long files?
- Multi-turn instruction following: do older constraints remain active after many generated turns?
- Safety and policy recall: do system-level instructions remain stable when the prompt is large?
Those checks are grounded in the broader long-context evaluation problem. Liu et al. used multi-document QA and key-value retrieval tasks to show that changing where relevant information appears in a long context can change model performance, especially when the evidence sits in the middle of the input.
The right comparison is not only MHA versus GQA at the same precision. Serving teams often compare GQA FP16 against MHA FP8, GQA FP8 against MQA FP16, or GQA INT4 against GQA FP8. The architectural choice and numeric choice multiply, but the quality effects may stack in uneven ways.
5. Quantization Multiplies the GQA Savings
Precision changes , the bytes per stored value:
- FP16/BF16 cache: roughly 2 bytes per value.
- FP8 cache: roughly 1 byte per value, plus scale policy details.
- INT4 cache: roughly 0.5 bytes per value, plus scale and packing overhead.
Architecture changes . Because both variables sit in the same product, the savings multiply:
For the same 32-layer, 32,768-token example, using INT4 with one FP16 scale per 64 values gives an effective 0.53125 bytes/value before extra metadata:
| Shape | FP16/BF16 | FP8 | INT4 group-64 |
|---|---|---|---|
| MHA, | 16.0 GiB | 8.0 GiB | 4.25 GiB |
| GQA-8, | 4.0 GiB | 2.0 GiB | 1.06 GiB |
| GQA-4, | 2.0 GiB | 1.0 GiB | 0.53 GiB |
| MQA, | 0.5 GiB | 0.25 GiB | 0.13 GiB |
The table explains a common capacity-planning surprise. A GQA model in FP16 can have the same cache footprint as an MHA model in INT4. For example, GQA-8 FP16 is 4.0 GiB here, while MHA INT4 group-64 is about 4.25 GiB. If model selection is still open, architecture may beat a more aggressive cache format.
That does not make quantization unimportant. It means the order of decisions matters:
- Choose a model architecture with a cache shape that fits the workload.
- Use cache paging to avoid allocator waste and support dynamic sequence lengths.
- Add FP8 or INT4 when long context or concurrency still makes cache memory the limiter.
- Evaluate the architecture and precision together, because head sharing and quantization can both weaken attention behavior.
The two earlier KV articles sit after step 1. They explain what happens once the model has already decided how many K/V values it writes.
6. A Serving Planner Should Ask for n_kv Early
The fastest way to mis-size a deployment is to ask only for parameter count and context length. Two 8B-class models can have similar weight memory and very different cache bills if one uses MHA and the other uses GQA.
Ask for these fields before estimating capacity:
layers
query_heads
kv_heads
head_dim
max_context_tokens
target_concurrent_sequences
cache_dtype
cache_scale_overhead
paged_cache_block_sizeA small planning function can separate the architecture term from the precision term:
def kv_cache_plan(
layers: int,
query_heads: int,
kv_heads: int,
head_dim: int,
context_tokens: int,
concurrent_sequences: int,
bytes_per_value: float,
) -> dict[str, float]:
per_sequence_gib = kv_cache_gib(
layers=layers,
kv_heads=kv_heads,
head_dim=head_dim,
tokens=context_tokens,
bytes_per_value=bytes_per_value,
)
return {
"query_heads": query_heads,
"kv_heads": kv_heads,
"kv_group_size": query_heads / kv_heads,
"per_sequence_gib": per_sequence_gib,
"batch_cache_gib": per_sequence_gib * concurrent_sequences,
}
print(kv_cache_plan(
layers=32,
query_heads=32,
kv_heads=8,
head_dim=128,
context_tokens=32_768,
concurrent_sequences=16,
bytes_per_value=1.0, # FP8 cache
))Expected output:
{
'query_heads': 32,
'kv_heads': 8,
'kv_group_size': 4.0,
'per_sequence_gib': 2.0,
'batch_cache_gib': 32.0
}That 32 GiB number is only the raw cache for 16 active sequences. It does not include model weights, temporary activations, allocator reserve, page metadata, CUDA graphs, communication buffers, LoRA adapters, or fragmentation. The estimate is still valuable because it makes the first-order cache bill visible.
For production decisions, use the raw formula as the floor, then validate with the actual serving engine:
- Measure prefill and decode separately.
- Track cache residency under realistic request length distributions.
- Compare accepted concurrency at the same latency target.
- Run task-specific quality checks for MHA, GQA, and MQA variants when model choice is open.
- Re-run long-context checks after enabling FP8 or INT4 KV cache.
The serving question becomes much cleaner when n_kv is treated as a first-class capacity variable. Quantization answers "how many bytes per stored value?" GQA answers "how many values did the model decide to store?"
References
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. NeurIPS 2017.
- Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. 2019.
- Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP 2023.
- Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the Middle: How Language Models Use Long Contexts. TACL 2024.
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
- NVIDIA TensorRT-LLM documentation. GPT Attention: MHA, MQA, GQA, paged KV cache, and INT8/FP8 KV caches.
