Attention Lesson 2 of 4

What is Attention?

Mechanism Definition

Attention is a learned mechanism that enables the model to assign differential relevance to each token in the input when computing the representation of a given token.

For each token, attention produces a set of weights over all other tokens. These weights have the following properties:

  • Range: [0, 1] per token pair
  • Normalization: weights sum to 1 (forming a probability distribution)
  • Interpretation: higher weight indicates greater relevance

Interactive Visualization

Attention Weights Visualization

Click on any word to see how much attention it pays to other words. Thicker lines = stronger attention.

Contextual Disambiguation

When processing "bank" in "I went to the bank to deposit money", the attention mechanism assigns high weights to "deposit" and "money". These tokens provide the contextual signal that resolves "bank" to its financial interpretation.

In "I sat on the river bank", attention shifts to "river" and "sat", producing a distinct representation corresponding to the geographical meaning.

Without Attention

"bank" → context-invariant vector

With Attention

"bank" → context-dependent vector

Key Takeaways

  • Attention computes pairwise relevance weights across all tokens
  • Weights form a normalized probability distribution (sum to 1)
  • The result is a context-dependent representation for each token