Abstract
KV cache quantization looks simple if you only count bits. FP16 uses 2 bytes per value. FP8 uses 1 byte. INT4 packs two values into 1 byte. That arithmetic is real, but it is not the whole engineering problem. The cache is read at every decode step, the keys shape the attention distribution, the values shape the content that flows forward, and the scales used during quantization decide whether small but important signals survive.
The hard cases show up when context grows, decode becomes memory-bound, and traffic mixes short chats with long retrieval sessions. FP8 and INT4 can buy more live-token capacity, but the price is paid through scale granularity, key/value asymmetry, recent-token precision, and workload-specific quality checks.
1. Start With the Memory Bill
For a decoder-only transformer, every generated token appends a key vector and a value vector for every attention layer. The raw KV cache footprint per token is
where is the number of layers, is the number of KV heads, is the head dimension, and is bytes per stored value.
Take a Llama-3-style 8B shape:
- layers
- KV heads
- one key and one value per token per layer
The number of cached values per token is
That gives the following approximate cache sizes:
| Cache format | Effective bytes/value | KiB/token | 32k tokens | 128k tokens |
|---|---|---|---|---|
| FP16/BF16 | 2.000 | 128.0 | 4.0 GiB | 16.0 GiB |
| FP8 | 1.000 | 64.0 | 2.0 GiB | 8.0 GiB |
| INT4 raw | 0.500 | 32.0 | 1.0 GiB | 4.0 GiB |
| INT4 with one FP16 scale per 64 values | 0.531 | 34.0 | 1.06 GiB | 4.25 GiB |
The last row is the important one. INT4 is not free 4x compression unless scale metadata, alignment, page layout, and kernel behavior are negligible. With group size 64 and one FP16 scale per group, the scale overhead is
so the effective storage is bytes/value before any extra metadata.
A small calculator makes the shape of the problem clearer:
def kv_cache_gib(
layers: int,
kv_heads: int,
head_dim: int,
tokens: int,
bytes_per_value: float,
) -> float:
values_per_token = 2 * layers * kv_heads * head_dim
total_bytes = values_per_token * tokens * bytes_per_value
return total_bytes / (1024 ** 3)
layers = 32
kv_heads = 8
head_dim = 128
for tokens in [8192, 32768, 131072]:
fp16 = kv_cache_gib(layers, kv_heads, head_dim, tokens, 2.0)
fp8 = kv_cache_gib(layers, kv_heads, head_dim, tokens, 1.0)
int4_group64 = kv_cache_gib(layers, kv_heads, head_dim, tokens, 0.53125)
print(tokens, round(fp16, 2), round(fp8, 2), round(int4_group64, 2))Expected output:
8192 1.0 0.5 0.27
32768 4.0 2.0 1.06
131072 16.0 8.0 4.25This is why the cache becomes the constraint before the model weights do. A quantized 8B model may fit easily, but a long-context workload can still exhaust memory because the cache grows with active tokens.
2. FP8 Is a Scale Policy, Not Just a Smaller Float
FP8 KV cache usually means storing K and V in an 8-bit floating-point format and recovering approximate values during attention. The simplified operation is
The scale is the part that decides whether this works. If is too small, large values saturate. If is too large, small values collapse into a small number of representable buckets.
Modern serving stacks expose this in different ways. vLLM supports FP8 KV cache through kv_cache_dtype="fp8" and provides three scale paths: no calibration with scales set to 1.0, random-token warmup calibration with calculate_kv_scales=True, and dataset calibration through llm-compressor. It also documents both per-tensor and per-attention-head FP8 schemes, with per-head scaling currently tied to the FlashAttention calibration path. TensorRT-LLM supports INT8 and FP8 KV caches; its GPT attention documentation describes a per-tensor kv_cache_scaling_factor for KV cache quantization. The exact kernel path matters: in vLLM's FlashAttention 3 path, FP8 KV cache can also put attention operations into the FP8 domain, so the precision decision is not only a storage decision.
The simplest FP8 policy is per-tensor scaling:
one scale for all K values in a layer
one scale for all V values in a layerThat is cheap. It is also blunt. A single outlier can stretch the range for the entire tensor.
Per-head scaling is more precise:
one K scale per KV head
one V scale per KV headThis costs more metadata and needs kernel support, but it better matches the fact that heads can have different activation ranges. In vLLM's current documentation, per-head FP8 KV scaling is available only through the FlashAttention backend with the llm-compressor calibration pathway.
Here is the practical rule: FP8 is usually the first compressed KV format to try because it halves cache memory while keeping the numerical behavior close enough for many workloads. But "FP8" without scale calibration is not a complete configuration. The production question is:
FP8 with which format, which scale granularity, which calibration set, and which attention backend?3. INT4 Is Where Grouping Becomes the Main Event
INT4 quantization stores each value as one of 16 codes. A common symmetric version uses signed levels from -7 to 7:
The scale is typically chosen per group:
This is where INT4 becomes less forgiving than FP8. With only 15 nonzero signed levels, one outlier can flatten the rest of the group.
Consider this group of eight key-cache values:
[-7.80, -0.18, -0.09, 0.02, 0.13, 0.20, 0.31, 0.44]Using one symmetric INT4 scale for the whole group:
The quantized values become approximately:
| Original | Original / s | INT4 code | Reconstructed |
|---|---|---|---|
| -7.80 | -7.00 | -7 | -7.80 |
| -0.18 | -0.16 | 0 | 0.00 |
| -0.09 | -0.08 | 0 | 0.00 |
| 0.02 | 0.02 | 0 | 0.00 |
| 0.13 | 0.12 | 0 | 0.00 |
| 0.20 | 0.18 | 0 | 0.00 |
| 0.31 | 0.28 | 0 | 0.00 |
| 0.44 | 0.39 | 0 | 0.00 |
The outlier is preserved. Almost everything else is erased.
Now remove the outlier and quantize the small values together:
[-0.18, -0.09, 0.02, 0.13, 0.20, 0.31, 0.44]The scale becomes
and the same small values now spread across useful codes:
| Original | INT4 code | Reconstructed |
|---|---|---|
| -0.18 | -3 | -0.19 |
| -0.09 | -1 | -0.06 |
| 0.02 | 0 | 0.00 |
| 0.13 | 2 | 0.13 |
| 0.20 | 3 | 0.19 |
| 0.31 | 5 | 0.31 |
| 0.44 | 7 | 0.44 |
This toy example captures most INT4 KV cache engineering. The hard part is not "can we pack 4-bit values?" The hard part is choosing groups that keep the scale local enough to preserve signal while staying regular enough for fused attention kernels.
4. Keys and Values Do Different Jobs
The key cache and value cache have different jobs.
Keys determine attention weights:
Values carry the content that gets mixed:
If a value vector is noisy, the model receives a distorted content vector. If a key vector is noisy, the model may attend to the wrong token entirely.
The difference is easiest to see with a small logit-margin example. Suppose two cached tokens are close competitors:
q = [1.00, 0.20]
k_a = [1.00, 0.00]
k_b = [0.96, 0.08]Ignoring the shared factor:
q dot k_a = 1.000
q dot k_b = 0.976
margin = 0.024The model barely prefers token a. If INT4 quantization perturbs the keys like this:
k_a' = [0.98, 0.00]
k_b' = [0.99, 0.08]then
q dot k_a' = 0.980
q dot k_b' = 1.006The preference flips. A tiny key error changes the attention destination.
This is why several KV quantization methods treat keys and values asymmetrically. KIVI, for example, argues for per-channel key quantization and per-token value quantization based on observed KV cache distributions. Recent system-aware INT4 work makes the same broader point from the serving side: the quantizer has to respect both accuracy and fused-kernel constraints.
The operational lesson is simple:
Do not evaluate "INT4 KV cache" as one thing.
Evaluate key policy and value policy separately.For runtimes that expose separate key and value precision policies, the safer progression is:
- FP8 K and FP8 V with calibrated scales.
- FP8 K and INT4 V.
- INT4 K and INT4 V only after long-context and task-specific evals pass.
5. The Recent Token Window Deserves Special Treatment
Not every token in the cache has the same sensitivity. The most recent tokens often dominate local coherence: syntax, formatting, tool-call JSON, closing brackets, variable names, and immediate reasoning state.
That suggests a residual-cache policy:
Keep the last R tokens in FP16/BF16 or FP8.
Quantize older tokens more aggressively.For example:
| Region | Example policy | Reason |
|---|---|---|
| Last 256 tokens | FP16 or FP8 | Preserve local syntax and tool-call structure. |
| 256 to 8192 tokens back | FP8 | Good default balance for active context. |
| Older than 8192 tokens | INT4 values, FP8 keys | Save memory where exact local detail is less likely to matter. |
The exact thresholds are workload-dependent. A code-generation assistant may need a larger high-precision tail because small syntax errors are expensive. A summarization workload may tolerate more aggressive value quantization. A retrieval-heavy question-answering workload may need higher precision for keys across distant evidence spans because the model must still find the right paragraph among distractors.
Hugging Face's cache documentation makes a related practical warning: quantized cache can hurt latency when context is short and GPU memory is sufficient. That is not a contradiction. Quantization helps when memory movement or capacity is the bottleneck. It can hurt when the extra quantize/dequantize path costs more than the memory savings return.
6. A Concrete Serving Budget
Assume a single 80 GiB GPU serving an 8B-class model. After weights, CUDA graphs, buffers, fragmentation reserve, and runtime overhead, suppose 45 GiB is available for KV cache.
Using the earlier Llama-3-style shape:
| Policy | Approx KiB/token | 45 GiB nominal live-token budget |
|---|---|---|
| FP16/BF16 KV | 128.0 | 368,640 tokens |
| FP8 KV | 64.0 | 737,280 tokens |
| INT4 group-64 KV | 34.0 | 1,387,821 tokens |
Now apply a 70% safety ceiling for allocator slack, burst handling, prefix sharing overhead, and p99 protection:
| Policy | Safe live-token budget at 70% |
|---|---|
| FP16/BF16 KV | 258,048 tokens |
| FP8 KV | 516,096 tokens |
| INT4 group-64 KV | 971,475 tokens |
This is the real reason INT4 remains attractive even when FP8 is safer. It can change admission control. If a product has many long sessions, the difference between 516k and 971k safe live tokens can be the difference between rejecting requests and staying online.
But the table hides two costs:
- INT4 needs stronger scale metadata and packing logic.
- INT4 needs better evals because the quality risk is larger and less uniform.
The right question is not "does INT4 save memory?" It does. The question is:
Which request classes can safely buy that memory with quantization error?7. What to Benchmark Before You Trust It
A KV quantization benchmark should not be a single average score. It should be stratified by context length, task type, and output-risk class.
Minimum context bins:
2k, 8k, 32k, 128kMinimum workload classes:
| Workload | Why it matters |
|---|---|
| Copy-sensitive code or JSON | Small errors can break execution. |
| Long-context retrieval QA | Distant weak signals must remain attendable. |
| Multi-turn chat | Recent state and old context both matter. |
| Summarization | Often more tolerant of value noise. |
| Tool calling | Syntax and argument fidelity matter more than generic fluency. |
Minimum system metrics:
tokens/sec
time to first token
inter-token latency p50/p95/p99
max admitted live tokens
KV reserved bytes vs live bytes
allocator block waste
OOM or eviction countMinimum quality metrics:
exact match for structured outputs
pass rate for code/tests
answer F1 or judge score for QA
citation hit rate for RAG
top-k logit overlap against FP16 baseline
KL divergence against FP16 baseline
tool-call parse success rateThe baseline comparison should be explicit:
FP16/BF16 KV baseline
FP8 KV with no calibration
FP8 KV with calibrated scales
INT4 V only, FP8 K
INT4 K and VIf possible, record per-token logit divergence. A simplified offline harness can look like this:
def topk_overlap(base_ids: list[int], test_ids: list[int]) -> float:
return len(set(base_ids) & set(test_ids)) / len(base_ids)
def should_flag_step(
kl_divergence: float,
topk_overlap_at_10: float,
exact_required: bool,
) -> bool:
if exact_required and topk_overlap_at_10 < 0.8:
return True
if kl_divergence > 0.15:
return True
return FalseThe thresholds above are placeholders, not universal constants. The point is to treat KV precision as a runtime behavior change, not a storage-only optimization.
8. Example Configurations to Test
For a vLLM offline experiment, start with calibrated FP8 rather than default uncalibrated scales:
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=512)
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
kv_cache_dtype="fp8",
calculate_kv_scales=True,
)
prompts = [
"Summarize the following long document...",
"Return valid JSON for this tool call...",
]
outputs = llm.generate(prompts, sampling_params)For a Hugging Face Transformers experiment with INT4 cache, test quanto or hqq cache backends in isolation from weight quantization:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.float16,
device_map="auto",
)
inputs = tokenizer("Write a JSON object with three fields:", return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
do_sample=False,
max_new_tokens=128,
cache_implementation="quantized",
cache_config={"nbits": 4, "backend": "quanto"},
)
print(tokenizer.decode(out[0], skip_special_tokens=True))For TensorRT-LLM, treat KV cache quantization as part of engine construction and quality validation, not a runtime toggle you enable blindly. A typical FP8 path uses an FP8 quantization configuration plus an FP8 KV cache setting. The important part is the validation gate: compare output quality after engine build, not just throughput.
9. Decision Rules That Hold Up in Practice
Use FP16/BF16 KV when:
- Context is short.
- GPU memory is not the bottleneck.
- Outputs are high-stakes and exactness matters.
- The serving stack lacks fused quantized attention kernels.
Use FP8 KV when:
- Long-context or high-concurrency traffic is memory-bound.
- The runtime supports stable FP8 KV kernels.
- You can calibrate scales or at least run warmup-scale estimation.
- Quality regressions are small across your actual task mix.
Use INT4 values with higher-precision keys when the serving runtime supports separate key/value policy and:
- FP8 does not create enough admission headroom.
- The workload is tolerant to mild value distortion.
- You can keep a recent high-precision window.
- You can monitor structured-output and retrieval regressions.
Use INT4 keys only when:
- You have per-channel or system-aware key quantization.
- Long-context evals pass at the target context length.
- Tool calls, exact copy tasks, and code outputs have separate pass-rate checks.
- There is an automatic rollback path.
Avoid global precision flags for heterogeneous products. A better policy is:
default: FP8 KV
structured tool calls: FP8 KV with larger high-precision tail
summarization: INT4 V + FP8 K after eval
long-context legal/medical/finance QA: FP8 or FP16 fallback
capacity emergency: downshift old values first, then old keys only if monitors stay clean10. The Regressions to Watch
KV quantization regressions usually do not look like random nonsense. They look like subtle capability erosion.
Common signs:
- The model still answers fluently but misses a fact from the middle of a long context.
- JSON is almost valid but has a wrong delimiter or missing quote.
- Code compiles less often even though prose explanations look normal.
- Tool arguments drift after long reasoning traces.
- The model over-focuses on recent context and ignores older evidence.
- Multi-turn conversations lose commitments made many turns earlier.
These are exactly the regressions hidden by short prompts and generic preference judging. A useful eval set should include adversarial long-context cases where the answer depends on a low-salience token far from the end of the prompt.
One simple test pattern:
Place 20 similar records in a 32k-token prompt.
Only one record contains the exact requested value.
Ask for that value in strict JSON.
Compare FP16, FP8, and INT4 outputs.
Move the record across positions: early, middle, late.
Repeat with distractors that share names, dates, or IDs.This catches both attention-routing drift and structured-output fragility.
Conclusion
KV cache quantization is not just a memory compression trick. It is an attention-policy decision. FP8 usually gives the best first step because it cuts cache memory roughly in half while keeping the engineering surface manageable. INT4 is the capacity lever, but it only works reliably when grouping, scale metadata, key/value asymmetry, residual windows, and fused kernels are designed together.
The safest production path is progressive: start with FP8 and calibrated scales, compare against an FP16/BF16 cache baseline, add INT4 values before INT4 keys, keep recent tokens at higher precision, and make rollback depend on both latency and quality metrics. The teams that get this right treat KV precision as a per-workload control surface, not a single checkbox.
References
- vLLM documentation. *Quantized KV Cache*. https://docs.vllm.ai/en/stable/features/quantization/quantized_kvcache/
- vLLM Team. *The State of FP8 KV-Cache and Attention Quantization in vLLM*. https://vllm.ai/blog/fp8-kvcache
- Hugging Face Transformers documentation. *Cache strategies*. https://huggingface.co/docs/transformers/en/kv_cache
- NVIDIA TensorRT-LLM documentation. *Multi-Head, Multi-Query, and Group-Query Attention*. https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html
- NVIDIA TensorRT-LLM documentation. *FP8 Quantization*. https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/fp8-quantization.html
- Liu, Z. et al. *KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache*. arXiv:2402.02750. https://arxiv.org/abs/2402.02750
- Jia, J. et al. *SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving*. arXiv:2604.19157. https://arxiv.org/abs/2604.19157
