Contextual Embedding
Contextual Embeddings - Simple Summary
What Are They?
Word representations that change based on context, unlike static embeddings where each word has a fixed vector.
Example: “bank” gets different embeddings in:
- “river bank” (geographic)
- “savings bank” (financial)
Where Are They Created?
Inside the self-attention mechanism of transformer layers.
How Do They Work?
The Process:
- Static embeddings (from lookup table) + positional encoding
- Self-attention calculates how much each word should “pay attention” to others
- Mix embeddings based on attention weights → Contextual embeddings
The Formula:
For each token i:
contextual_embedding[i] = Σ(attention_weight[i,j] × value_embedding[j])
j=0 to sequence_length
Key insight: Each token’s final embedding is a weighted sum of ALL tokens in the sequence (including itself).
Current Usage in LLMs
Modern Models:
- GPT-4, Claude, etc.: Use 100+ transformer layers
- Each layer creates more sophisticated contextual embeddings
- Context windows: Up to 2M+ tokens
- Multi-head attention: Captures different relationship types
Architecture:
- GPT: Causal (masked) self-attention
- BERT: Bidirectional self-attention
- T5: Encoder-decoder attention
Why Important?
- Polysemy: Same word, different meanings
- Context understanding: “bank” near “river” vs “loan”
- Long-range dependencies: Words can influence each other across long distances
- Foundation of modern NLP: All current LLMs are built on this principle
Key Takeaway
Contextual embeddings aren’t just used in modern LLMs - they ARE what makes modern LLMs work. Every token’s representation dynamically incorporates information from the entire context through self-attention.