Context Changes Everything
Divergent Attention Patterns
The following demonstration presents the token "bank" in two distinct sentential contexts. The attention weight distributions differ substantially between the two cases, reflecting the mechanism's sensitivity to surrounding tokens.
Compare Attention Patterns
See how "bank" pays attention to different words based on context.
Contextual Embeddings
After attention, each token receives a new representation that incorporates information from its context. This output is termed a contextual embedding.
- Static embedding: "bank" maps to a context-invariant vector
- Contextual embedding: "bank" maps to a context-dependent vector
This property enables models such as Claude to resolve lexical ambiguity, detect pragmatic nuance, and process figurative language that context-free representations cannot distinguish.
Attention Weight Matrix
Attention Heatmap
Each row shows how much that word attends to every other word. Brighter = more attention.
From Attention to Transformers
Production language models employ multi-head attention — multiple independent attention functions operating in parallel — and stack dozens of attention layers. Each successive layer refines the token representations through progressively higher-order abstraction.
The Transformer architecture (Vaswani et al., 2017) unified attention with residual connections and feed-forward networks, establishing the foundation for GPT, BERT, Claude, and virtually all contemporary language models.
Key Takeaways
- Attention produces context-dependent representations for each token
- Identical tokens receive distinct vectors in different sentential contexts
- Multi-head, multi-layer attention is the computational foundation of modern language models