When Long Context Makes KV Cache Quantization Worth It: FP8, INT4, and Scale Budgets

Abstract

KV cache quantization looks simple if you only count bits. FP16 uses 2 bytes per value. FP8 uses 1 byte. INT4 packs two values into 1 byte. That arithmetic is real, but it is not the whole engineering problem. The cache is read at every decode step, the keys shape the attention distribution, the values shape the content that flows forward, and the scales used during quantization decide whether small but important signals survive.

The hard cases show up when context grows, decode becomes memory-bound, and traffic mixes short chats with long retrieval sessions. FP8 and INT4 can buy more live-token capacity, but the price is paid through scale granularity, key/value asymmetry, recent-token precision, and workload-specific quality checks.

1. Start With the Memory Bill

For a decoder-only transformer, every generated token appends a key vector and a value vector for every attention layer. The raw KV cache footprint per token is

M_{\text{token}} = 2 \cdot L \cdot n_{\text{kv}} \cdot d_{\text{head}} \cdot b

where $L$ is the number of layers, $n_{\text{kv}}$ is the number of KV heads, $d_{\text{head}}$ is the head dimension, and $b$ is bytes per stored value.

Take a Llama-3-style 8B shape:

$L = 32$ layers
$n_{\text{kv}} = 8$ KV heads
$d_{\text{head}} = 128$
one key and one value per token per layer

The number of cached values per token is

2 \cdot 32 \cdot 8 \cdot 128 = 65{,}536

That gives the following approximate cache sizes:

KV cache memory budget by numeric format — KV cache footprint by precision format

Cache format	Effective bytes/value	KiB/token	32k tokens	128k tokens
FP16/BF16	2.000	128.0	4.0 GiB	16.0 GiB
FP8	1.000	64.0	2.0 GiB	8.0 GiB
INT4 raw	0.500	32.0	1.0 GiB	4.0 GiB
INT4 with one FP16 scale per 64 values	0.531	34.0	1.06 GiB	4.25 GiB

The last row is the important one. INT4 is not free 4x compression unless scale metadata, alignment, page layout, and kernel behavior are negligible. With group size 64 and one FP16 scale per group, the scale overhead is

\frac{2}{64} = 0.03125 \text{ bytes/value}

so the effective storage is $0.5 + 0.03125 = 0.53125$ bytes/value before any extra metadata.

A small calculator makes the shape of the problem clearer:

def kv_cache_gib(
    layers: int,
    kv_heads: int,
    head_dim: int,
    tokens: int,
    bytes_per_value: float,
) -> float:
    values_per_token = 2 * layers * kv_heads * head_dim
    total_bytes = values_per_token * tokens * bytes_per_value
    return total_bytes / (1024 ** 3)


layers = 32
kv_heads = 8
head_dim = 128

for tokens in [8192, 32768, 131072]:
    fp16 = kv_cache_gib(layers, kv_heads, head_dim, tokens, 2.0)
    fp8 = kv_cache_gib(layers, kv_heads, head_dim, tokens, 1.0)
    int4_group64 = kv_cache_gib(layers, kv_heads, head_dim, tokens, 0.53125)
    print(tokens, round(fp16, 2), round(fp8, 2), round(int4_group64, 2))

Expected output:

8192 1.0 0.5 0.27
32768 4.0 2.0 1.06
131072 16.0 8.0 4.25

This is why the cache becomes the constraint before the model weights do. A quantized 8B model may fit easily, but a long-context workload can still exhaust memory because the cache grows with active tokens.

2. FP8 Is a Scale Policy, Not Just a Smaller Float

FP8 KV cache usually means storing K and V in an 8-bit floating-point format and recovering approximate values during attention. The simplified operation is

\hat{x} = \mathrm{FP8Round}(x / s), \quad x_{\text{dequant}} = s \cdot \hat{x}

The scale $s$ is the part that decides whether this works. If $s$ is too small, large values saturate. If $s$ is too large, small values collapse into a small number of representable buckets.

Modern serving stacks expose this in different ways. vLLM supports FP8 KV cache through kv_cache_dtype="fp8" and provides three scale paths: no calibration with scales set to 1.0, random-token warmup calibration with calculate_kv_scales=True, and dataset calibration through llm-compressor. It also documents both per-tensor and per-attention-head FP8 schemes, with per-head scaling currently tied to the FlashAttention calibration path. TensorRT-LLM supports INT8 and FP8 KV caches; its GPT attention documentation describes a per-tensor kv_cache_scaling_factor for KV cache quantization. The exact kernel path matters: in vLLM's FlashAttention 3 path, FP8 KV cache can also put attention operations into the FP8 domain, so the precision decision is not only a storage decision.

The simplest FP8 policy is per-tensor scaling:

one scale for all K values in a layer
one scale for all V values in a layer

That is cheap. It is also blunt. A single outlier can stretch the range for the entire tensor.

Per-head scaling is more precise:

one K scale per KV head
one V scale per KV head

This costs more metadata and needs kernel support, but it better matches the fact that heads can have different activation ranges. In vLLM's current documentation, per-head FP8 KV scaling is available only through the FlashAttention backend with the llm-compressor calibration pathway.

Here is the practical rule: FP8 is usually the first compressed KV format to try because it halves cache memory while keeping the numerical behavior close enough for many workloads. But "FP8" without scale calibration is not a complete configuration. The production question is:

FP8 with which format, which scale granularity, which calibration set, and which attention backend?

3. INT4 Is Where Grouping Becomes the Main Event

INT4 quantization stores each value as one of 16 codes. A common symmetric version uses signed levels from -7 to 7:

q_i = \mathrm{clip}(\mathrm{round}(x_i / s), -7, 7), \quad x_i \approx s \cdot q_i

The scale $s$ is typically chosen per group:

s = \frac{\max_i |x_i|}{7}

This is where INT4 becomes less forgiving than FP8. With only 15 nonzero signed levels, one outlier can flatten the rest of the group.

How one outlier collapses useful INT4 codes

Consider this group of eight key-cache values:

[-7.80, -0.18, -0.09, 0.02, 0.13, 0.20, 0.31, 0.44]

Using one symmetric INT4 scale for the whole group:

s = 7.80 / 7 = 1.114

The quantized values become approximately:

Original	Original / s	INT4 code	Reconstructed
-7.80	-7.00	-7	-7.80
-0.18	-0.16	0	0.00
-0.09	-0.08	0	0.00
0.02	0.02	0	0.00
0.13	0.12	0	0.00
0.20	0.18	0	0.00
0.31	0.28	0	0.00
0.44	0.39	0	0.00

The outlier is preserved. Almost everything else is erased.

Now remove the outlier and quantize the small values together:

[-0.18, -0.09, 0.02, 0.13, 0.20, 0.31, 0.44]

The scale becomes

s = 0.44 / 7 = 0.063

and the same small values now spread across useful codes:

Original	INT4 code	Reconstructed
-0.18	-3	-0.19
-0.09	-1	-0.06
0.02	0	0.00
0.13	2	0.13
0.20	3	0.19
0.31	5	0.31
0.44	7	0.44

This toy example captures most INT4 KV cache engineering. The hard part is not "can we pack 4-bit values?" The hard part is choosing groups that keep the scale local enough to preserve signal while staying regular enough for fused attention kernels.

4. Keys and Values Do Different Jobs

The key cache and value cache have different jobs.

Keys determine attention weights:

\ell_j = \frac{q_t^\top k_j}{\sqrt{d_h}}, \quad \alpha_j = \mathrm{softmax}(\ell)_j

Values carry the content that gets mixed:

o_t = \sum_j \alpha_j v_j

If a value vector is noisy, the model receives a distorted content vector. If a key vector is noisy, the model may attend to the wrong token entirely.

Key and value quantization risks — Keys route attention while values carry content

The difference is easiest to see with a small logit-margin example. Suppose two cached tokens are close competitors:

q   = [1.00, 0.20]
k_a = [1.00, 0.00]
k_b = [0.96, 0.08]

Ignoring the shared $\sqrt{d_h}$ factor:

q dot k_a = 1.000
q dot k_b = 0.976
margin    = 0.024

The model barely prefers token a. If INT4 quantization perturbs the keys like this:

k_a' = [0.98, 0.00]
k_b' = [0.99, 0.08]

then

q dot k_a' = 0.980
q dot k_b' = 1.006

The preference flips. A tiny key error changes the attention destination.

This is why several KV quantization methods treat keys and values asymmetrically. KIVI, for example, argues for per-channel key quantization and per-token value quantization based on observed KV cache distributions. Recent system-aware INT4 work makes the same broader point from the serving side: the quantizer has to respect both accuracy and fused-kernel constraints.

The operational lesson is simple:

Do not evaluate "INT4 KV cache" as one thing.
Evaluate key policy and value policy separately.

For runtimes that expose separate key and value precision policies, the safer progression is:

FP8 K and FP8 V with calibrated scales.
FP8 K and INT4 V.
INT4 K and INT4 V only after long-context and task-specific evals pass.

5. The Recent Token Window Deserves Special Treatment

Not every token in the cache has the same sensitivity. The most recent tokens often dominate local coherence: syntax, formatting, tool-call JSON, closing brackets, variable names, and immediate reasoning state.

That suggests a residual-cache policy:

Keep the last R tokens in FP16/BF16 or FP8.
Quantize older tokens more aggressively.

For example:

Region	Example policy	Reason
Last 256 tokens	FP16 or FP8	Preserve local syntax and tool-call structure.
256 to 8192 tokens back	FP8	Good default balance for active context.
Older than 8192 tokens	INT4 values, FP8 keys	Save memory where exact local detail is less likely to matter.

The exact thresholds are workload-dependent. A code-generation assistant may need a larger high-precision tail because small syntax errors are expensive. A summarization workload may tolerate more aggressive value quantization. A retrieval-heavy question-answering workload may need higher precision for keys across distant evidence spans because the model must still find the right paragraph among distractors.

Hugging Face's cache documentation makes a related practical warning: quantized cache can hurt latency when context is short and GPU memory is sufficient. That is not a contradiction. Quantization helps when memory movement or capacity is the bottleneck. It can hurt when the extra quantize/dequantize path costs more than the memory savings return.

6. A Concrete Serving Budget

Assume a single 80 GiB GPU serving an 8B-class model. After weights, CUDA graphs, buffers, fragmentation reserve, and runtime overhead, suppose 45 GiB is available for KV cache.

Using the earlier Llama-3-style shape:

Policy	Approx KiB/token	45 GiB nominal live-token budget
FP16/BF16 KV	128.0	368,640 tokens
FP8 KV	64.0	737,280 tokens
INT4 group-64 KV	34.0	1,387,821 tokens

Now apply a 70% safety ceiling for allocator slack, burst handling, prefix sharing overhead, and p99 protection:

Policy	Safe live-token budget at 70%
FP16/BF16 KV	258,048 tokens
FP8 KV	516,096 tokens
INT4 group-64 KV	971,475 tokens

This is the real reason INT4 remains attractive even when FP8 is safer. It can change admission control. If a product has many long sessions, the difference between 516k and 971k safe live tokens can be the difference between rejecting requests and staying online.

But the table hides two costs:

INT4 needs stronger scale metadata and packing logic.
INT4 needs better evals because the quality risk is larger and less uniform.

The right question is not "does INT4 save memory?" It does. The question is:

Which request classes can safely buy that memory with quantization error?

7. What to Benchmark Before You Trust It

A KV quantization benchmark should not be a single average score. It should be stratified by context length, task type, and output-risk class.

Minimum context bins:

2k, 8k, 32k, 128k

Minimum workload classes:

Workload	Why it matters
Copy-sensitive code or JSON	Small errors can break execution.
Long-context retrieval QA	Distant weak signals must remain attendable.
Multi-turn chat	Recent state and old context both matter.
Summarization	Often more tolerant of value noise.
Tool calling	Syntax and argument fidelity matter more than generic fluency.

Minimum system metrics:

tokens/sec
time to first token
inter-token latency p50/p95/p99
max admitted live tokens
KV reserved bytes vs live bytes
allocator block waste
OOM or eviction count

Minimum quality metrics:

exact match for structured outputs
pass rate for code/tests
answer F1 or judge score for QA
citation hit rate for RAG
top-k logit overlap against FP16 baseline
KL divergence against FP16 baseline
tool-call parse success rate

The baseline comparison should be explicit:

FP16/BF16 KV baseline
FP8 KV with no calibration
FP8 KV with calibrated scales
INT4 V only, FP8 K
INT4 K and V

If possible, record per-token logit divergence. A simplified offline harness can look like this:

def topk_overlap(base_ids: list[int], test_ids: list[int]) -> float:
    return len(set(base_ids) & set(test_ids)) / len(base_ids)


def should_flag_step(
    kl_divergence: float,
    topk_overlap_at_10: float,
    exact_required: bool,
) -> bool:
    if exact_required and topk_overlap_at_10 < 0.8:
        return True
    if kl_divergence > 0.15:
        return True
    return False

The thresholds above are placeholders, not universal constants. The point is to treat KV precision as a runtime behavior change, not a storage-only optimization.

8. Example Configurations to Test

For a vLLM offline experiment, start with calibrated FP8 rather than default uncalibrated scales:

from vllm import LLM, SamplingParams

sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=512)

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    kv_cache_dtype="fp8",
    calculate_kv_scales=True,
)

prompts = [
    "Summarize the following long document...",
    "Return valid JSON for this tool call...",
]

outputs = llm.generate(prompts, sampling_params)

For a Hugging Face Transformers experiment with INT4 cache, test quanto or hqq cache backends in isolation from weight quantization:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float16,
    device_map="auto",
)

inputs = tokenizer("Write a JSON object with three fields:", return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=128,
    cache_implementation="quantized",
    cache_config={"nbits": 4, "backend": "quanto"},
)

print(tokenizer.decode(out[0], skip_special_tokens=True))

For TensorRT-LLM, treat KV cache quantization as part of engine construction and quality validation, not a runtime toggle you enable blindly. A typical FP8 path uses an FP8 quantization configuration plus an FP8 KV cache setting. The important part is the validation gate: compare output quality after engine build, not just throughput.

9. Decision Rules That Hold Up in Practice

Use FP16/BF16 KV when:

Context is short.
GPU memory is not the bottleneck.
Outputs are high-stakes and exactness matters.
The serving stack lacks fused quantized attention kernels.

Use FP8 KV when:

Long-context or high-concurrency traffic is memory-bound.
The runtime supports stable FP8 KV kernels.
You can calibrate scales or at least run warmup-scale estimation.
Quality regressions are small across your actual task mix.

Use INT4 values with higher-precision keys when the serving runtime supports separate key/value policy and:

FP8 does not create enough admission headroom.
The workload is tolerant to mild value distortion.
You can keep a recent high-precision window.
You can monitor structured-output and retrieval regressions.

Use INT4 keys only when:

You have per-channel or system-aware key quantization.
Long-context evals pass at the target context length.
Tool calls, exact copy tasks, and code outputs have separate pass-rate checks.
There is an automatic rollback path.

Avoid global precision flags for heterogeneous products. A better policy is:

default: FP8 KV
structured tool calls: FP8 KV with larger high-precision tail
summarization: INT4 V + FP8 K after eval
long-context legal/medical/finance QA: FP8 or FP16 fallback
capacity emergency: downshift old values first, then old keys only if monitors stay clean

10. The Regressions to Watch

KV quantization regressions usually do not look like random nonsense. They look like subtle capability erosion.

Common signs:

The model still answers fluently but misses a fact from the middle of a long context.
JSON is almost valid but has a wrong delimiter or missing quote.
Code compiles less often even though prose explanations look normal.
Tool arguments drift after long reasoning traces.
The model over-focuses on recent context and ignores older evidence.
Multi-turn conversations lose commitments made many turns earlier.

These are exactly the regressions hidden by short prompts and generic preference judging. A useful eval set should include adversarial long-context cases where the answer depends on a low-salience token far from the end of the prompt.

One simple test pattern:

Place 20 similar records in a 32k-token prompt.
Only one record contains the exact requested value.
Ask for that value in strict JSON.
Compare FP16, FP8, and INT4 outputs.
Move the record across positions: early, middle, late.
Repeat with distractors that share names, dates, or IDs.

This catches both attention-routing drift and structured-output fragility.

Conclusion

KV cache quantization is not just a memory compression trick. It is an attention-policy decision. FP8 usually gives the best first step because it cuts cache memory roughly in half while keeping the engineering surface manageable. INT4 is the capacity lever, but it only works reliably when grouping, scale metadata, key/value asymmetry, residual windows, and fused kernels are designed together.

The safest production path is progressive: start with FP8 and calibrated scales, compare against an FP16/BF16 cache baseline, add INT4 values before INT4 keys, keep recent tokens at higher precision, and make rollback depend on both latency and quality metrics. The teams that get this right treat KV precision as a per-workload control surface, not a single checkbox.

References

vLLM documentation. *Quantized KV Cache*. https://docs.vllm.ai/en/stable/features/quantization/quantized_kvcache/
vLLM Team. *The State of FP8 KV-Cache and Attention Quantization in vLLM*. https://vllm.ai/blog/fp8-kvcache
Hugging Face Transformers documentation. *Cache strategies*. https://huggingface.co/docs/transformers/en/kv_cache
NVIDIA TensorRT-LLM documentation. *Multi-Head, Multi-Query, and Group-Query Attention*. https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html
NVIDIA TensorRT-LLM documentation. *FP8 Quantization*. https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/fp8-quantization.html
Liu, Z. et al. *KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache*. arXiv:2402.02750. https://arxiv.org/abs/2402.02750
Jia, J. et al. *SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving*. arXiv:2604.19157. https://arxiv.org/abs/2604.19157