QKV Attention Explained: Scaling Heads and Softmax Behavior

Abstract

Self-attention has become the core primitive for modern sequence models, yet much of the intuition around queries, keys, and values (Q/K/V) remains abstract. A useful mental model treats attention as a learned kernel: queries and keys define similarity, values carry content, and the softmax turns score geometry into token mixing. Scaled dot-product attention uses $1/\sqrt{d_k}$ to normalize score variance under common random-vector assumptions. Head count still changes behavior, but mainly through representation partitioning, learned projections, per-head capacity, optimization dynamics, and runtime kernel behavior.

Intuition

Attention can be seen as learned nearest-neighbor search in a content-based embedding space. For each token position $t$ , the model produces:

a query $q_t$ representing “what I’m looking for,”
a key $k_s$ representing “what I contain,”
a value $v_s$ representing “what I’ll provide.”

The attention output at position $t$ is a weighted average of values $\{v_s\}$ , where weights are determined by the similarity between $q_t$ and $k_s$ . In practice, the similarity is a dot product or a scaled dot product. Because the softmax is exponential, small changes in dot products can cause large shifts in weight allocation. Thus, the statistical distribution of dot products is not an implementation detail—it shapes attention behavior.

Multi-head attention is often introduced as “multiple subspaces.” A more precise mental model is: each head imposes its own kernel, with its own learned projections and statistics, on the sequence. As head count changes, the model changes how much representation capacity each head receives and how many distinct kernels can be mixed downstream.

In short: head scaling changes behavior because it changes the geometry, capacity allocation, and learned specialization of the similarity functions.

Mathematical Formulation

Consider an input sequence $X \in \mathbb{R}^{T \times d_{\text{model}}}$ with $T$ tokens. A single attention head uses linear projections:

Q = XW_Q,\quad K = XW_K,\quad V = XW_V

where $W_Q, W_K, W_V \in \mathbb{R}^{d_{\text{model}} \times d_k}$ . Here $d_k$ is the per-head dimensionality. The unnormalized attention scores are:

S = QK^\top \in \mathbb{R}^{T \times T}

and scaled dot-product attention is:

\text{Attn}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Why the scale?

Assume entries of $Q$ and $K$ are approximately i.i.d. with zero mean and unit variance. Then for a single pair of tokens:

q_t \cdot k_s = \sum_{i=1}^{d_k} q_{t,i}k_{s,i}

If $q_{t,i}$ and $k_{s,i}$ are independent with variance 1, then:

\mathbb{E}[q_t \cdot k_s] = 0, \quad \text{Var}(q_t \cdot k_s) = d_k

Thus the unscaled dot products grow with $\sqrt{d_k}$ . Without scaling, larger $d_k$ yields larger magnitude scores and a sharper softmax, potentially saturating and hurting gradient flow. Scaling by $1/\sqrt{d_k}$ normalizes the variance under these simplifying assumptions:

\text{Var}\!\left(\frac{q_t \cdot k_s}{\sqrt{d_k}}\right) = 1

So the original Transformer scale prevents head dimension from directly driving score variance under the iid model. Real trained heads can still have different effective temperatures because the learned projections, normalization layers, data distribution, and implementation details change the score distribution.

Multi-head attention

Multi-head attention splits the model dimension $d_{\text{model}}$ into $H$ heads:

d_k = d_v = \frac{d_{\text{model}}}{H}

Each head $h$ produces:

\text{head}_h = \text{softmax}\!\left(\frac{Q_h K_h^\top}{\sqrt{d_k}}\right)V_h

and the outputs are concatenated and projected:

\text{MHA}(X) = \text{Concat}(\text{head}_1,\dots,\text{head}_H)W_O

Scaling the number of heads $H$ implies shrinking per-head dimension $d_k$ if $d_{\text{model}}$ is fixed. That changes capacity allocation even though the scaled dot-product formula normalizes the simplest variance term.

Head Scaling Effects

Head scaling typically means increasing $H$ while keeping $d_{\text{model}}$ fixed. This reduces $d_k$ , which has several consequences.

1) Score distribution and softmax sharpness

If $d_k$ shrinks, the variance of unscaled dot products shrinks:

\text{Var}(q \cdot k) = d_k

After scaling by $1/\sqrt{d_k}$ , the iid variance is normalized to 1, but the distributional shape can still change. In small $d_k$ , dot products are sums of fewer terms and can show more variability in tail events when projections are learned and inputs are structured. In practice, this can lead to either:

more peaky distributions if a few dimensions align strongly, or
more uniform attention if projections are weak.

These two effects can both happen depending on the learned projections. Smaller $d_k$ often yields more head-to-head diversity and more specialized behavior, but that is an empirical/training outcome rather than a direct consequence of the scale factor alone.

2) Effective rank and expressivity

The attention matrix for a head is:

A_h = \text{softmax}\!\left(\frac{Q_hK_h^\top}{\sqrt{d_k}}\right)

Even if $Q_hK_h^\top$ is full rank, the softmax tends to produce low-rank-like structure because it concentrates mass. As $H$ increases, each head produces a smaller projection but there are more attention matrices. The combined output can be seen as a mixture of low-rank kernels. Increasing heads increases the mixture count while decreasing each kernel’s embedding dimension.

This mirrors the tradeoff in mixture models: more components, less capacity per component.

3) Geometric interpretation

Consider the normalized similarity:

\frac{q_t \cdot k_s}{\|q_t\| \|k_s\|} = \cos(\theta_{t,s})

When $d_k$ is large, random vectors concentrate around orthogonality, and cosine similarities are tightly distributed around 0. When $d_k$ is small, cosine similarities are more spread, making it easier for random alignments to produce high similarity. This affects the baseline probability of attention to unrelated tokens.

Thus, head scaling changes the signal-to-noise ratio of similarity measures.

4) Gradient and optimization dynamics

Attention output is:

Y = \text{softmax}(S)V

The gradient $\partial Y/\partial S$ depends strongly on softmax sharpness. If attention is too sharp, gradients vanish for non-selected tokens; if too flat, gradients are diffuse and weak. Smaller $d_k$ (with more heads) can shift heads into either of these regimes depending on learned scale. In practice, models often learn per-head scaling or use techniques like RMSNorm/LayerNorm to stabilize.

5) Why “scaling heads” changes behavior

Combining the above:

More heads → smaller $d_k$ → more kernels and less capacity per kernel
Fewer heads → larger $d_k$ → fewer kernels and more capacity per kernel

Hence, scaling head count changes what attention “looks like” (more diverse patterns) and how it behaves during training (different gradient regimes).

Practical Implications

1) Specialization vs. generalization

Empirically, increasing head count tends to produce more specialized heads: some attend to syntax, some to positional patterns, others to semantic matching. But this comes at the cost of reduced per-head capacity. If $d_k$ becomes too small, each head may be underpowered for capturing complex patterns. This explains why simply increasing heads does not always improve performance.

2) Head dimension as temperature control

Although the scaling $1/\sqrt{d_k}$ is fixed, the learned projections can amplify or suppress the effective temperature. Per-head dimension still matters because it sets the amount of feature capacity available to each kernel. Smaller heads can lead to brittle or noisy attention maps if the task requires richer comparisons than the head can represent. If you observe overly diffuse or overly spiky attention, consider adjusting $H$ , $d_k$ , normalization, or projection initialization.

3) Distributional effects in large models

In large-scale models, attention maps are often analyzed in aggregate. But head scaling alters the distribution of attention entropy across heads. For instance, with many heads, entropy variance increases: some heads become extremely sparse while others are diffuse. This can be good (diversity) or bad (instability), depending on the task and training regime.

4) Compute tradeoffs

Computational cost of attention is:

O(T^2 d_{\text{model}})

But per-head attention uses matrices of size $T \times d_k$ . Increasing head count increases the number of matrix multiplications but keeps total dimensions constant. In practice, kernel fusion and GPU efficiency often favor a moderate number of heads rather than extreme head counts.

5) Interpreting attention maps

If you visualize attention in a model with many heads, you’ll see many seemingly “noisy” heads. This is not necessarily failure—it’s a consequence of smaller $d_k$ and higher variance in similarity. When evaluating head behavior, consider both head count and head dimension.

Visual Aids

The visualizations below are toy simulations and explanatory diagrams, not measurements from a trained production model.

Vector geometry (query vs. keys)

Query and keys in 2D dot-product geometry

Scaled dot-product distribution by head dimension

Cosine similarity distribution vs. head dimension

Average attention entropy vs number of heads (toy)

Conclusion

Attention is a learned kernel whose statistics govern model behavior. Queries, keys, and values define the geometry, while scaling keeps the simplest dot-product variance under control. Head count changes the number of learned kernels and the capacity available to each one, which then affects specialization, attention entropy, gradient dynamics, and runtime efficiency.

For practitioners, the takeaway is clear: head count and head dimension are not cosmetic hyperparameters. They shape the attention landscape, impacting both interpretability and performance. Understanding these effects allows more informed scaling decisions and better diagnosis of model behavior in practice.

References

Vaswani et al. *Attention Is All You Need*. NeurIPS, 2017.
Choromanski et al. *Rethinking Attention with Performers*. ICLR, 2021.
Jain and Wallace. *Attention is not Explanation*. NAACL, 2019.
Xiong et al. *On Layer Normalization in the Transformer Architecture*. ICML, 2020.
Press et al. *Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation*. ICLR, 2022.

---

Attention in Practice: Visualizing Q/K/V and why scaling heads changes behavior