Attention Lesson 4 of 4

Context Changes Everything

Divergent Attention Patterns

The following demonstration presents the token "bank" in two distinct sentential contexts. The attention weight distributions differ substantially between the two cases, reflecting the mechanism's sensitivity to surrounding tokens.

Compare Attention Patterns

See how "bank" pays attention to different words based on context.

Contextual Embeddings

After attention, each token receives a new representation that incorporates information from its context. This output is termed a contextual embedding.

Static embedding: "bank" maps to a context-invariant vector
Contextual embedding: "bank" maps to a context-dependent vector

This property enables models such as Claude to resolve lexical ambiguity, detect pragmatic nuance, and process figurative language that context-free representations cannot distinguish.

Attention Weight Matrix

Attention Heatmap

Each row shows how much that word attends to every other word. Brighter = more attention.

From Attention to Transformers

Production language models employ multi-head attention — multiple independent attention functions operating in parallel — and stack dozens of attention layers. Each successive layer refines the token representations through progressively higher-order abstraction.

The Transformer architecture (Vaswani et al., 2017) unified attention with residual connections and feed-forward networks, establishing the foundation for GPT, BERT, Claude, and virtually all contemporary language models.

Key Takeaways

Attention produces context-dependent representations for each token
Identical tokens receive distinct vectors in different sentential contexts
Multi-head, multi-layer attention is the computational foundation of modern language models