Attention Lesson 3 of 4

How Attention Works

The Query-Key-Value Framework

The attention mechanism operates through three learned linear transformations applied to each input token:

Query (Q): Represents the information the token is seeking
Key (K): Represents the information the token exposes for matching
Value (V): Represents the information the token contributes to the output

By analogy to information retrieval: the Query functions as a search query, Keys function as index entries, and Values function as the retrieved content.

Formal Computation

Attention(Q, K, V) = softmax(Q × K^T / √d) × V

The computation proceeds in three stages:

Compute compatibility scores between each Query and all Keys (dot product)
Normalize scores into a probability distribution (softmax, sum to 1)
Compute the output as a weighted sum of Values using these weights

Step-by-Step Demonstration

Attention Calculation Steps

Click "Next" to step through how attention is calculated for "The cat sat".

Design Properties

The Q/K/V decomposition provides three critical properties:

Learnability: The projection matrices are trained end-to-end, allowing the model to discover task-relevant notions of "query" and "key"
Full connectivity: Every token can attend to every other token, imposing no structural constraints on dependency distance
Parallelism: All pairwise scores are computed simultaneously via matrix multiplication, enabling efficient hardware utilization

Key Takeaways

Q, K, V are three learned linear projections of each token's representation
Attention scores quantify the compatibility between each Query-Key pair
The output is a weighted combination of Values, where weights are determined by attention scores