FP8 KV Cache Decode Latency: When It Speeds Up LLM Serving

Abstract

FP8 KV cache is easy to oversell if the only number on the page is bytes per value. BF16 stores each key and value element in 2 bytes. FP8 stores each element in 1 byte. The raw cache-value bytes halve before scale metadata and paging overhead, but serving latency does not automatically halve.

The latency result depends on the attention path. Some runtimes store K/V in FP8 and dequantize on the way into an FP16/BF16 attention kernel. Some run the attention matmuls in the FP8 domain. Some add scale loads, query quantization, two-level accumulation, or per-layer exceptions. Those choices decide whether FP8 lowers inter-token latency, only improves admission control, or slows down prefill.

For the memory and scale-budget groundwork, use the earlier KV quantization guide. For the architectural reason the cache size can change before quantization, use the GQA and KV-cache bill. The missing piece is the runtime question: when does FP8 actually reduce decode time?

1. FP8 KV Cache Has More Than One Runtime Shape

Two deployments can both say "FP8 KV cache" and mean different execution paths.

The simplest version is a storage optimization. The cache lives in FP8, so more tokens fit in memory. When attention runs, the runtime converts cached values back to a higher-precision path. That can still be valuable because OOM pressure drops and the scheduler can keep more sequences resident. But the latency gain depends on whether dequantization is fused into the attention kernel.

TensorRT-LLM documents this kind of path for GPT attention. It supports INT8 and FP8 KV caches. The attention operator stores quantized cache values using a scaling factor, then dequantizes cache reads on the fly inside the MHA/MQA kernel. In the documented implementation, the kv_cache_scaling_factor tensor has shape [1], so the scale is per tensor rather than per head.

vLLM's FlashAttention 3 path exposes a stronger version. Its FP8 KV-cache mode stores K/V in FP8 and, with the FlashAttention 3 backend, also performs attention operations in the quantized domain. The vLLM documentation notes that queries are quantized to FP8 in addition to keys and values in that configuration. The April 2026 vLLM FP8 KV-cache write-up describes the QK and ScoreV matrix multiplications running in FP8 with E4M3.

That gives three engineering cases:

Runtime shape	Memory effect	Latency effect
FP8 storage only	Fewer cache bytes.	Better residency, but no guaranteed per-token speedup.
FP8 storage plus fused dequant	Fewer cache bytes and less memory traffic.	Decode may improve if scale/dequant overhead is small.
FP8 cache plus FP8 attention	Fewer cache bytes and FP8 matmul path.	Decode slope can improve, but accuracy and prefill depend on kernel details.

The phrase "FP8 KV cache" is therefore incomplete. The operational question is:

Which attention backend reads the cache, where are scales applied,
and does the kernel do QK / ScoreV in FP8 or after dequantization?

2. Decode Speed Is a Slope Problem

During decode, each new token attends over the stored context. If the prompt is longer, the next token reads more K/V entries. A useful first-order model is:

\text{ITL} = \alpha \cdot T_{\text{input}} + \beta

where $T_{\text{input}}$ is the number of cached input tokens, $\alpha$ is the input-length-dependent slope, and $\beta$ is fixed overhead.

FP8 helps decode when it lowers $\alpha$ enough to compensate for any increase in $\beta$ . That is why short prompts may not show much speedup while long prompts do.

The vLLM April 2026 benchmarking gives a concrete H100 example for Llama-3.1-8B using FlashAttention 3. In the reported single-request fit, BF16 KV had slope $4.37 \times 10^{-5}$ ms/token and intercept 6.44 ms. FP8 KV had slope $2.37 \times 10^{-5}$ ms/token and intercept 6.58 ms. The fixed cost rose by about 0.14 ms, but the slope fell to about 54% of BF16.

Decode break-even for FP8 KV cache — FP8 decode latency wins after the break-even context length

The break-even point is:

T_{\text{break-even}} = \frac{\beta_{\text{FP8}} - \beta_{\text{BF16}}} {\alpha_{\text{BF16}} - \alpha_{\text{FP8}}}

Using those reported values:

bf16_slope = 4.37e-5
bf16_intercept = 6.44

fp8_slope = 2.37e-5
fp8_intercept = 6.58

break_even_tokens = (fp8_intercept - bf16_intercept) / (bf16_slope - fp8_slope)
print(round(break_even_tokens))

Expected output:

That number is not universal. It belongs to a specific model, GPU, backend, vLLM version, and benchmark setup. The reusable lesson is the shape: FP8 decode speed is not a single percentage. It is a slope and intercept comparison.

For serving dashboards, track these separately:

Metric	Why it matters
ITL slope vs input length	Shows whether each extra cached token became cheaper.
ITL intercept	Shows fixed overhead from quantization, scales, and kernel setup.
Output tokens/sec under load	Shows whether the isolated slope win survives batching and scheduling.
Accepted concurrent tokens	Shows the memory-residency gain from halving cache bytes.

In vLLM's reported concurrency-8 load test for Llama-3.1-8B with roughly 20k input tokens and 2k output tokens per request, FP8 improved output throughput by 14.9% and median ITL by 14.8%. That is much smaller than a raw 2x byte reduction, but still material because it appears in the part of serving that repeats for every generated token.

3. Prefill Has a Different Answer

Prefill processes the prompt. Decode generates tokens after the prompt is cached. They stress the system differently.

During prefill, attention over the input can be compute-heavy, tile-shape-sensitive, and affected by register pressure. FP8 can help if the kernel maps well to the hardware. It can also regress when the accuracy fixes needed for long contexts introduce overhead.

vLLM's FP8 work found exactly that kind of split. For head dimensions 64 and 128, the improved FP8 path can speed both prefill and decode in the validated cases. For larger head dimensions such as 256, decode still improved, but prefill could be slower because two-level accumulation increased register pressure.

The reason comes from long-context accumulation. In attention, the ScoreV operation accumulates over the context dimension. At 100k+ context lengths, accumulation precision can become a real accuracy issue. vLLM reported that an older Hopper FP8 FlashAttention 3 path dropped from a 91% BF16 baseline to 13% FP8 on a 128k needle-in-a-haystack task. A two-level accumulation strategy brought the FP8 result back near baseline, at 89%, but it also added register pressure.

That is a useful warning for production teams. A kernel path can be fast and wrong, accurate and slower, or fast only after a backend fix. The dtype flag is not enough context.

Measure TTFT and ITL separately:

TTFT: prompt processing, cache write, first output token
ITL: repeated decode step after the cache exists

Then segment by prompt length. A chat workload with short prompts and long outputs can value ITL more than TTFT. A retrieval workload with huge prompts and short answers may care more about TTFT and long-context accuracy.

4. Sliding-Window Layers Can Waste FP8 Overhead

FP8 KV cache is most attractive when a layer reads a large, growing cache. Sliding-window attention changes that shape. If a layer attends only to the last $W$ tokens, then its cache traffic is bounded by $W$ , not the full prompt length.

That makes small sliding-window layers poor FP8 targets. They still pay scale handling, conversion, and kernel overhead, but they do not get much long-context memory-read reduction.

vLLM's gpt-oss-20b example showed this clearly. The model has a hybrid attention pattern with global and sliding-window layers. In earlier measurements, full FP8 for that model had almost the same ITL slope as BF16, so the break-even context length was far outside normal use. After backend work, full FP8 improved, but the best result came from skipping the sliding-window layers and keeping them in BF16 while quantizing global layers.

The serving policy should be layer-aware:

global attention layers      -> strong FP8 candidate
large sliding-window layers  -> benchmark
small sliding-window layers  -> often keep BF16
sensitive layers             -> skip or calibrate more carefully

This is also why FP8 KV cache cannot be evaluated only at the model level. If the model mixes attention types, the average hides where the speedup comes from and where overhead is being wasted.

5. Calibration Decides Whether FP8 Is a Safe Default

FP8 E4M3 has more precision than E5M2 but less dynamic range. Scales decide which activation ranges survive quantization.

vLLM documents three scale paths:

Scale path	What happens	When it is useful
No calibration	Scales default to 1.0.	Fastest reproducibility check and lower-bound accuracy test.
Random warmup calibration	Scales are estimated from a random-token warmup batch.	Better than default scales when dataset calibration is unavailable.
Dataset calibration	Scales are produced through an offline calibration path such as `llm-compressor`.	Best candidate for production accuracy, especially with per-head scales.

Granularity matters. Per-tensor scaling applies one scale to the tensor. Per-attention-head scaling gives each query or KV head its own scale. vLLM documents per-attention-head FP8 KV-cache scaling for the FlashAttention backend through the llm-compressor calibration pathway. TensorRT-LLM's documented GPT attention path, by contrast, uses a per-tensor scaling factor for FP8 KV cache.

The practical impact is simple: per-tensor scaling is cheaper and easier, but one outlier can stretch the scale for many values. Per-head scaling can preserve more local range, but it requires backend support and calibration artifacts.

A minimal experiment matrix:

BF16 KV baseline
FP8 KV, default scale = 1.0
FP8 KV, warmup scales
FP8 KV, dataset calibrated scales
FP8 KV, per-head scales where supported
FP8 KV with sensitive or sliding-window layers skipped

Run that matrix on the workloads that matter: long-context retrieval, code generation, tool-call formatting, long reasoning traces, and short-chat throughput. Average accuracy alone is too weak. The tasks that fail first are usually the tasks with narrow margins: exact copy, delimiter placement, reference selection among similar passages, and older instruction recall.

6. A Rollout Grid for FP8 KV Cache

The right FP8 decision depends on the workload, not the dtype.

Decision grid for FP8 KV cache deployment

Use FP8 first when all of these are true:

Decode is a major latency or cost driver.
Context lengths are long enough for the lower ITL slope to matter.
The attention backend has a validated FP8 path for the model shape.
Head dimension is 64 or 128, or the backend has proven results for larger heads.
Accuracy checks include long-context and exactness-sensitive slices.

Treat FP8 as memory-only until proven otherwise when:

The runtime stores FP8 but dequantizes into a higher-precision attention path.
The workload is dominated by short prompts.
Queueing, sampling, network time, or tool calls dominate end-to-end latency.
The scheduler benefits mainly from admitting more live tokens.

Skip layers or keep BF16 when:

Sliding-window layers have small windows.
Specific layers are known to be sensitive.
The model uses large head_dim and prefill latency matters.
Long-context accuracy regresses under the FP8 backend.

The rollout checklist should fit on one screen:

1. Confirm backend path: storage-only, dequant, or FP8 attention.
2. Measure TTFT and ITL separately.
3. Fit ITL slope against input length.
4. Compare throughput under realistic concurrency.
5. Run calibrated and uncalibrated scale variants.
6. Test layer skipping for hybrid attention models.
7. Validate exactness-sensitive and long-context slices.
8. Keep a BF16 fallback for high-risk traffic.

The headline trade-off is not "FP8 is faster" or "FP8 is risky." A better production rule is:

FP8 KV cache always changes memory.
It changes decode latency only when the attention path can spend fewer bytes without adding more overhead.
It preserves quality only when accumulation, scale granularity, and calibration match the workload.

That turns FP8 from a dtype toggle into an engineering decision.

References

Jonas Kubler, Eldar Kurtic, Lucas Wilkinson, Matthew Bonanni, Michael Goin, Alexandre Marques, and Kailash Budhathoki. The State of FP8 KV-Cache and Attention Quantization in vLLM. vLLM Blog, April 22, 2026.
vLLM documentation. Quantized KV Cache.
NVIDIA TensorRT-LLM documentation. GPT Attention: MHA, MQA, GQA, paged KV cache, and INT8/FP8 KV caches.
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. 2024.
Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization. ICML 2025.

When FP8 KV Cache Speeds Up Decode and When It Only Saves Memory