Home/Blog/Attention in Practice: Visualizing Q/K/V and why scaling heads changes behavior
Back to Blog
Feb 26, 2026LLM Architecture

Attention in Practice: Visualizing Q/K/V and why scaling heads changes behavior

A walkthrough of scaled dot-product attention (Q/K/V), softmax temperature, and why increasing head count shifts attention statistics and behavior.

Share on X

Abstract

Self-attention has become the core primitive for modern sequence models, yet much of the intuition around queries, keys, and values (Q/K/V) remains abstract. This post provides a view of attention that makes the geometry explicit and connects scaling behavior to head count. We derive the standard scaled dot-product attention, inspect its softmax temperature, analyze variance under random projections, and show how the number of heads changes the distribution of similarity scores and the effective rank of the attention map. We then connect these effects to practical phenomena—token mixing, feature specialization, gradient stability, and compute–quality tradeoffs. The goal is to provide a working mental model: attention is a learned kernel machine whose statistics shift with head scaling, and those shifts alter both what patterns are found and how confidently they are used.

Intuition

Attention can be seen as learned nearest-neighbor search in a content-based embedding space. For each token position tt, the model produces:

  • a query qtq_t representing “what I’m looking for,”
  • a key ksk_s representing “what I contain,”
  • a value vsv_s representing “what I’ll provide.”

The attention output at position tt is a weighted average of values {vs}\{v_s\}, where weights are determined by the similarity between qtq_t and ksk_s. In practice, the similarity is a dot product or a scaled dot product. Because the softmax is exponential, small changes in dot products can cause large shifts in weight allocation. Thus, the statistical distribution of dot products is not an implementation detail—it shapes attention behavior.

Multi-head attention is often introduced as “multiple subspaces.” But a more precise mental model is: each head imposes its own kernel, with its own statistics, on the sequence. As we scale the number of heads, we change not only the subspaces but also the per-head dimensionality, which changes the variance and concentration of dot-product scores. This shifts the sharpness of the attention distribution and alters how “confident” attention becomes.

In short: head scaling changes behavior because it changes the geometry and variance of similarity scores.

Mathematical Formulation

Consider an input sequence XRT×dmodelX \in \mathbb{R}^{T \times d_{\text{model}}} with TT tokens. A single attention head uses linear projections:

Q=XWQ,K=XWK,V=XWVQ = XW_Q,\quad K = XW_K,\quad V = XW_V

where WQ,WK,WVRdmodel×dkW_Q, W_K, W_V \in \mathbb{R}^{d_{\text{model}} \times d_k}. Here dkd_k is the per-head dimensionality. The unnormalized attention scores are:

S=QKRT×TS = QK^\top \in \mathbb{R}^{T \times T}

and scaled dot-product attention is:

Attn(Q,K,V)=softmax ⁣(QKdk)V\text{Attn}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Why the scale?

Assume entries of QQ and KK are approximately i.i.d. with zero mean and unit variance. Then for a single pair of tokens:

qtks=i=1dkqt,iks,iq_t \cdot k_s = \sum_{i=1}^{d_k} q_{t,i}k_{s,i}

If qt,iq_{t,i} and ks,ik_{s,i} are independent with variance 1, then:

E[qtks]=0,Var(qtks)=dk\mathbb{E}[q_t \cdot k_s] = 0, \quad \text{Var}(q_t \cdot k_s) = d_k

Thus the dot products scale with dk\sqrt{d_k}. Without scaling, larger dkd_k yields larger magnitude scores and a sharper softmax, potentially saturating and hurting gradient flow. Scaling by 1/dk1/\sqrt{d_k} normalizes the variance:

Var ⁣(qtksdk)=1\text{Var}\!\left(\frac{q_t \cdot k_s}{\sqrt{d_k}}\right) = 1

So the “temperature” of attention depends on dkd_k.

Multi-head attention

Multi-head attention splits the model dimension dmodeld_{\text{model}} into HH heads:

dk=dv=dmodelHd_k = d_v = \frac{d_{\text{model}}}{H}

Each head hh produces:

headh=softmax ⁣(QhKhdk)Vh\text{head}_h = \text{softmax}\!\left(\frac{Q_h K_h^\top}{\sqrt{d_k}}\right)V_h

and the outputs are concatenated and projected:

MHA(X)=Concat(head1,,headH)WO\text{MHA}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_H)W_O

Scaling the number of heads HH implies shrinking per-head dimension \(d_k\) if dmodeld_{\text{model}} is fixed. That is the core lever that changes behavior.

Head Scaling Effects

Head scaling typically means increasing HH while keeping dmodeld_{\text{model}} fixed. This reduces dkd_k, which has several consequences.

1) Variance of dot products and softmax sharpness

If dkd_k shrinks, the variance of unscaled dot products shrinks:

Var(qk)=dk\text{Var}(q \cdot k) = d_k

After scaling by 1/dk1/\sqrt{d_k}, the variance is normalized to 1, but the distributional shape still changes. In small dkd_k, dot products are sums of fewer terms and thus exhibit higher kurtosis and more variability in tail events (non-Gaussianity). Empirically, this leads to either:

  • more peaky distributions if a few dimensions align strongly, or
  • more uniform attention if projections are weak.

These two effects appear contradictory, but both can happen depending on the learned projections. In practice, smaller dkd_k yields higher variance across heads and more “specialized” behavior.

2) Effective rank and expressivity

The attention matrix for a head is:

Ah=softmax ⁣(QhKhdk)A_h = \text{softmax}\!\left(\frac{Q_hK_h^\top}{\sqrt{d_k}}\right)

Even if QhKhQ_hK_h^\top is full rank, the softmax tends to produce low-rank-like structure because it concentrates mass. As HH increases, each head produces a smaller projection but there are more attention matrices. The combined output can be seen as a mixture of low-rank kernels. Increasing heads increases the mixture count while decreasing each kernel’s embedding dimension.

This mirrors the tradeoff in mixture models: more components, less capacity per component.

3) Geometric interpretation

Consider the normalized similarity:

qtksqtks=cos(θt,s)\frac{q_t \cdot k_s}{\|q_t\| \|k_s\|} = \cos(\theta_{t,s})

When dkd_k is large, random vectors concentrate around orthogonality, and cosine similarities are tightly distributed around 0. When dkd_k is small, cosine similarities are more spread, making it easier for random alignments to produce high similarity. This affects the baseline probability of attention to unrelated tokens.

Thus, head scaling changes the signal-to-noise ratio of similarity measures.

4) Gradient and optimization dynamics

Attention output is:

Y=softmax(S)VY = \text{softmax}(S)V

The gradient Y/S\partial Y/\partial S depends strongly on softmax sharpness. If attention is too sharp, gradients vanish for non-selected tokens; if too flat, gradients are diffuse and weak. Smaller dkd_k (with more heads) can shift heads into either of these regimes depending on learned scale. In practice, models often learn per-head scaling or use techniques like RMSNorm/LayerNorm to stabilize.

5) Why “scaling heads” changes behavior

Combining the above:

  • More headssmaller \(d_k\)more variance in similarity and specialization
  • Fewer headslarger \(d_k\)more stable, averaged similarity distributions

Hence, scaling head count changes what attention “looks like” (more diverse patterns) and how it behaves during training (different gradient regimes).

Practical Implications

1) Specialization vs. generalization

Empirically, increasing head count tends to produce more specialized heads: some attend to syntax, some to positional patterns, others to semantic matching. But this comes at the cost of reduced per-head capacity. If dkd_k becomes too small, each head may be underpowered for capturing complex patterns. This explains why simply increasing heads does not always improve performance.

2) Head dimension as temperature control

Although the scaling 1/dk1/\sqrt{d_k} is fixed, the learned projections can amplify or suppress the effective temperature. Yet, per-head dimension still matters because it sets the range of achievable dot products. Smaller heads can lead to brittle or noisy attention maps. If you observe overly diffuse or overly spiky attention, consider adjusting HH or dkd_k.

3) Distributional effects in large models

In large-scale models, attention maps are often analyzed in aggregate. But head scaling alters the distribution of attention entropy across heads. For instance, with many heads, entropy variance increases: some heads become extremely sparse while others are diffuse. This can be good (diversity) or bad (instability), depending on the task and training regime.

4) Compute tradeoffs

Computational cost of attention is:

O(T2dmodel)O(T^2 d_{\text{model}})

But per-head attention uses matrices of size T×dkT \times d_k. Increasing head count increases the number of matrix multiplications but keeps total dimensions constant. In practice, kernel fusion and GPU efficiency often favor a moderate number of heads rather than extreme head counts.

5) Interpreting attention maps

If you visualize attention in a model with many heads, you’ll see many seemingly “noisy” heads. This is not necessarily failure—it’s a consequence of smaller dkd_k and higher variance in similarity. When evaluating head behavior, consider both head count and head dimension.

Visual Aids

Vector geometry (query vs. keys)

Query and keys in 2D dot-product geometry
Query and keys in 2D dot-product geometry

Scaled dot-product distribution by head dimension

Scaled dot-product distribution by head dimension
Scaled dot-product distribution by head dimension

Cosine similarity distribution vs. head dimension

Cosine similarity distribution vs head dimension
Cosine similarity distribution vs head dimension

Average attention entropy vs number of heads (toy)

Average attention entropy vs number of heads
Average attention entropy vs number of heads

Conclusion

Attention is not just a convenient architectural trick—it is a learned kernel whose statistics govern model behavior. Queries, keys, and values define the geometry, while scaling and head count determine the distribution of similarity scores. As head count increases, per-head dimension shrinks, altering dot-product variance, attention sharpness, specialization, and gradient dynamics. This is why scaling heads changes behavior.

For practitioners, the takeaway is clear: head count and head dimension are not cosmetic hyperparameters. They shape the attention landscape, impacting both interpretability and performance. Understanding these effects allows more informed scaling decisions and better diagnosis of model behavior in practice.

References

  1. Vaswani et al., *Attention Is All You Need*, NeurIPS 2017.
  2. Choromanski et al., *Rethinking Attention with Performers*, ICLR 2021.
  3. Dong et al., *Attention Is Not Explanation*, ACL 2022.
  4. Xiong et al., *Layer Normalization in the Transformer Architecture*, NeurIPS 2020.
  5. Jain & Wallace, *Attention is not Explanation*, NAACL 2019.
  6. Press et al., *Train Short, Test Long: Attention with Linear Biases*, ICLR 2022.

---