Attention Lesson 3 of 4

How Attention Works

The Query-Key-Value Framework

The attention mechanism operates through three learned linear transformations applied to each input token:

  • Query (Q): Represents the information the token is seeking
  • Key (K): Represents the information the token exposes for matching
  • Value (V): Represents the information the token contributes to the output

By analogy to information retrieval: the Query functions as a search query, Keys function as index entries, and Values function as the retrieved content.

Formal Computation

Attention(Q, K, V) = softmax(Q × KT / √d) × V

The computation proceeds in three stages:

  1. Compute compatibility scores between each Query and all Keys (dot product)
  2. Normalize scores into a probability distribution (softmax, sum to 1)
  3. Compute the output as a weighted sum of Values using these weights

Step-by-Step Demonstration

Attention Calculation Steps

Click "Next" to step through how attention is calculated for "The cat sat".

Design Properties

The Q/K/V decomposition provides three critical properties:

  • Learnability: The projection matrices are trained end-to-end, allowing the model to discover task-relevant notions of "query" and "key"
  • Full connectivity: Every token can attend to every other token, imposing no structural constraints on dependency distance
  • Parallelism: All pairwise scores are computed simultaneously via matrix multiplication, enabling efficient hardware utilization

Key Takeaways

  • Q, K, V are three learned linear projections of each token's representation
  • Attention scores quantify the compatibility between each Query-Key pair
  • The output is a weighted combination of Values, where weights are determined by attention scores