What is Attention?
Mechanism Definition
Attention is a learned mechanism that enables the model to assign differential relevance to each token in the input when computing the representation of a given token.
For each token, attention produces a set of weights over all other tokens. These weights have the following properties:
- Range: [0, 1] per token pair
- Normalization: weights sum to 1 (forming a probability distribution)
- Interpretation: higher weight indicates greater relevance
Interactive Visualization
Attention Weights Visualization
Click on any word to see how much attention it pays to other words. Thicker lines = stronger attention.
Contextual Disambiguation
When processing "bank" in "I went to the bank to deposit money", the attention mechanism assigns high weights to "deposit" and "money". These tokens provide the contextual signal that resolves "bank" to its financial interpretation.
In "I sat on the river bank", attention shifts to "river" and "sat", producing a distinct representation corresponding to the geographical meaning.
Without Attention
"bank" → context-invariant vector
With Attention
"bank" → context-dependent vector
Key Takeaways
- Attention computes pairwise relevance weights across all tokens
- Weights form a normalized probability distribution (sum to 1)
- The result is a context-dependent representation for each token