How Attention Works
Query, Key, Value
Attention uses three learned transformations of each word:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide?"
Think of it like a search engine: the Query is your search term, Keys are like page titles, and Values are the actual content.
The calculation
Attention(Q, K, V) = softmax(Q × KT / √d) × V
Don't worry about the math details. The key insight is:
- Compare each Query with all Keys (dot product)
- Convert scores to weights (softmax → sum to 1)
- Use weights to combine Values
Step through the process
Attention Calculation Steps
Click "Next" to step through how attention is calculated for "The cat sat".
Why this design?
The Q/K/V design is powerful because:
- Learnable: The model learns what to query and what to expose as keys
- Flexible: Any word can attend to any other word
- Parallel: All attention calculations happen simultaneously
Key Takeaways
- Q, K, V are three different projections of each word
- Attention = how much each Query matches each Key
- Output = weighted sum of Values based on attention