Understanding Transformer Attention: A Visual Guide
A practical walkthrough of how self-attention works in transformers, why it matters, and how to build intuition for what attention heads actually learn, with visual examples.
When I built the LLM Simulator page on this site, I wanted to create an interactive way to understand transformer attention. But before the interactive version, let me walk through the core concepts in a way that builds real intuition, not just equation familiarity.
What Problem Does Attention Solve?
Before transformers, sequence models (RNNs, LSTMs) processed tokens one at a time, left to right. This created a bottleneck: information from early tokens had to survive through every subsequent step to influence later predictions. Long-range dependencies got lost in the noise.
Attention solves this by letting every token directly "look at" every other token in the sequence. Instead of passing information through a chain, each token computes a weighted sum over all other tokens, where the weights reflect relevance. This is why transformers can handle long documents — the 500th token can directly attend to the 3rd token.
Query, Key, Value: The Lookup Metaphor
The QKV mechanism is often explained with matrix equations, but the intuition is simpler. Think of it as a soft dictionary lookup:
- Query (Q): "What am I looking for?" — each token generates a query vector representing what information it needs
- Key (K): "What do I contain?" — each token generates a key vector advertising what information it offers
- Value (V): "Here's my actual content" — the information that gets retrieved when a query matches a key
The attention score between two tokens is the dot product of the query and key, normalized by the square root of the dimension. High dot product = high relevance = more of that token's value gets mixed in.
Why Scale by Square Root of d?
This is a detail people often memorize without understanding. Without scaling, dot products grow proportionally to the dimension size. Large dot products push softmax into regions where gradients are tiny (the function is nearly flat at extreme values). Dividing by sqrt(d_k) keeps the variance of the dot products at approximately 1, regardless of dimension, ensuring healthy gradient flow during training.
Multi-Head Attention: Parallel Pattern Detection
A single attention head learns one type of relationship. Multi-head attention runs several attention heads in parallel, each with its own learned Q, K, V projections. This lets the model simultaneously attend to different types of patterns:
- Head 1 might learn syntactic relationships (subject-verb agreement)
- Head 2 might learn positional patterns (attending to the previous token)
- Head 3 might learn semantic similarity (attending to related concepts)
The outputs of all heads are concatenated and projected back to the model dimension. The model's representational power comes not from any single head, but the combination.
Attention in Practice: What Heads Actually Learn
If you visualize attention patterns in a trained model (which you can do on the LLM Simulator page), you'll see surprisingly interpretable patterns. Some heads consistently attend to the previous token. Some attend to the first token (which often serves as a "default" sink). Some learn to attend to tokens with specific syntactic roles. And some heads seem to do nothing useful at all — redundancy is a feature, not a bug, in robust systems.
The key insight: attention is not just a mechanism, it's a learned routing system. The model learns to dynamically route information based on context. This is fundamentally different from fixed architectures, and it's why transformers generalize so well.