What's Contextual Attention
Self-Attention vs Contextual Attention
Important Clarification
“Contextual attention” is not a standard term in the field. You might be thinking of different types of attention mechanisms. Let me explain the key distinctions:
Self-Attention (Standard Term)
Definition: Each token attends to all tokens in the same sequence (including itself)
Key Characteristics:
- Input sequence: [“The”, “bank”, “river”, “flows”]
- Each word looks at ALL words in the same sentence
- “bank” attends to: “The”, “bank”, “river”, “flows”
- Used in: BERT, GPT, most modern transformers
Formula:
Attention(Q,K,V) = softmax(QK^T/√d)V
where Q, K, V all come from the same input sequence
Cross-Attention (What you might mean by “contextual”)
Definition: Tokens from one sequence attend to tokens from another sequence
Key Characteristics:
- Two sequences: Source and Target
- Example: Translation - English sentence attends to French sentence
- Query comes from target, Key/Value from source
- Used in: Encoder-decoder models, T5, original Transformer
Formula:
CrossAttention(Q,K,V) = softmax(QK^T/√d)V
where Q comes from sequence A, K,V come from sequence B
Common Confusion Points
1. Self-Attention Creates Contextual Embeddings
- Self-attention mechanism → produces contextual embeddings
- The mechanism is “self-attention”
- The output is “contextual embeddings”
2. Types of Self-Attention
Type | Description | Example |
---|---|---|
Bidirectional | Can attend to past and future tokens | BERT |
Causal/Masked | Can only attend to past tokens | GPT |
Local | Only attends to nearby tokens | Some efficient transformers |
Global | Attends to all tokens | Standard transformers |
3. Attention Variants
Variant | What it means |
---|---|
Self-Attention | Same sequence attends to itself |
Cross-Attention | Different sequences attend to each other |
Multi-Head | Multiple attention computations in parallel |
Scaled Dot-Product | Standard attention formula with scaling |
Visual Comparison
Self-Attention (BERT/GPT style):
Input: "The bank river flows"
↓ ↓ ↓ ↓
[Attend to all tokens in same sequence]
↓ ↓ ↓ ↓
Output: Contextual embeddings
Cross-Attention (Translation style):
English: "The bank" → French: "La banque"
↓ ↑
[Attend across languages]
In Modern LLMs
GPT Models:
- Use causal self-attention
- Each token attends to previous tokens only
- Creates contextual embeddings
BERT Models:
- Use bidirectional self-attention
- Each token attends to all tokens
- Creates contextual embeddings
T5 Models:
- Encoder: Bidirectional self-attention
- Decoder: Causal self-attention + cross-attention to encoder
Bottom Line
- Self-attention = the mechanism
- Contextual embeddings = the output
- Cross-attention = attention between different sequences
- There’s no standard term “contextual attention” - you likely mean one of the above!