What's Contextual Attention

Self-Attention vs Contextual Attention

Important Clarification

“Contextual attention” is not a standard term in the field. You might be thinking of different types of attention mechanisms. Let me explain the key distinctions:

Self-Attention (Standard Term)

Definition: Each token attends to all tokens in the same sequence (including itself)

Key Characteristics:

  • Input sequence: [“The”, “bank”, “river”, “flows”]
  • Each word looks at ALL words in the same sentence
  • “bank” attends to: “The”, “bank”, “river”, “flows”
  • Used in: BERT, GPT, most modern transformers

Formula:

Attention(Q,K,V) = softmax(QK^T/√d)V
where Q, K, V all come from the same input sequence

Cross-Attention (What you might mean by “contextual”)

Definition: Tokens from one sequence attend to tokens from another sequence

Key Characteristics:

  • Two sequences: Source and Target
  • Example: Translation - English sentence attends to French sentence
  • Query comes from target, Key/Value from source
  • Used in: Encoder-decoder models, T5, original Transformer

Formula:

CrossAttention(Q,K,V) = softmax(QK^T/√d)V
where Q comes from sequence A, K,V come from sequence B

Common Confusion Points

1. Self-Attention Creates Contextual Embeddings

  • Self-attention mechanism → produces contextual embeddings
  • The mechanism is “self-attention”
  • The output is “contextual embeddings”

2. Types of Self-Attention

TypeDescriptionExample
BidirectionalCan attend to past and future tokensBERT
Causal/MaskedCan only attend to past tokensGPT
LocalOnly attends to nearby tokensSome efficient transformers
GlobalAttends to all tokensStandard transformers

3. Attention Variants

VariantWhat it means
Self-AttentionSame sequence attends to itself
Cross-AttentionDifferent sequences attend to each other
Multi-HeadMultiple attention computations in parallel
Scaled Dot-ProductStandard attention formula with scaling

Visual Comparison

Self-Attention (BERT/GPT style):

Input: "The bank river flows"
       ↓    ↓    ↓     ↓
    [Attend to all tokens in same sequence]
       ↓    ↓    ↓     ↓
Output: Contextual embeddings

Cross-Attention (Translation style):

English: "The bank"  →  French: "La banque"
         ↓                      ↑
      [Attend across languages]

In Modern LLMs

GPT Models:

  • Use causal self-attention
  • Each token attends to previous tokens only
  • Creates contextual embeddings

BERT Models:

  • Use bidirectional self-attention
  • Each token attends to all tokens
  • Creates contextual embeddings

T5 Models:

  • Encoder: Bidirectional self-attention
  • Decoder: Causal self-attention + cross-attention to encoder

Bottom Line

  • Self-attention = the mechanism
  • Contextual embeddings = the output
  • Cross-attention = attention between different sequences
  • There’s no standard term “contextual attention” - you likely mean one of the above!