How Attention Works
The Query-Key-Value Framework
The attention mechanism operates through three learned linear transformations applied to each input token:
- Query (Q): Represents the information the token is seeking
- Key (K): Represents the information the token exposes for matching
- Value (V): Represents the information the token contributes to the output
By analogy to information retrieval: the Query functions as a search query, Keys function as index entries, and Values function as the retrieved content.
Formal Computation
Attention(Q, K, V) = softmax(Q × KT / √d) × V
The computation proceeds in three stages:
- Compute compatibility scores between each Query and all Keys (dot product)
- Normalize scores into a probability distribution (softmax, sum to 1)
- Compute the output as a weighted sum of Values using these weights
Step-by-Step Demonstration
Attention Calculation Steps
Click "Next" to step through how attention is calculated for "The cat sat".
Design Properties
The Q/K/V decomposition provides three critical properties:
- Learnability: The projection matrices are trained end-to-end, allowing the model to discover task-relevant notions of "query" and "key"
- Full connectivity: Every token can attend to every other token, imposing no structural constraints on dependency distance
- Parallelism: All pairwise scores are computed simultaneously via matrix multiplication, enabling efficient hardware utilization
Key Takeaways
- Q, K, V are three learned linear projections of each token's representation
- Attention scores quantify the compatibility between each Query-Key pair
- The output is a weighted combination of Values, where weights are determined by attention scores