Contextual Embedding

Contextual Embeddings - Simple Summary

What Are They?

Word representations that change based on context, unlike static embeddings where each word has a fixed vector.

Example: “bank” gets different embeddings in:

“river bank” (geographic)
“savings bank” (financial)

Where Are They Created?

Inside the self-attention mechanism of transformer layers.

How Do They Work?

The Process:

Static embeddings (from lookup table) + positional encoding
Self-attention calculates how much each word should “pay attention” to others
Mix embeddings based on attention weights → Contextual embeddings

The Formula:

For each token i:
contextual_embedding[i] = Σ(attention_weight[i,j] × value_embedding[j])
                         j=0 to sequence_length

Key insight: Each token’s final embedding is a weighted sum of ALL tokens in the sequence (including itself).

Current Usage in LLMs

Modern Models:

GPT-4, Claude, etc.: Use 100+ transformer layers
Each layer creates more sophisticated contextual embeddings
Context windows: Up to 2M+ tokens
Multi-head attention: Captures different relationship types

Architecture:

GPT: Causal (masked) self-attention
BERT: Bidirectional self-attention
T5: Encoder-decoder attention

Why Important?

Polysemy: Same word, different meanings
Context understanding: “bank” near “river” vs “loan”
Long-range dependencies: Words can influence each other across long distances
Foundation of modern NLP: All current LLMs are built on this principle

Key Takeaway

Contextual embeddings aren’t just used in modern LLMs - they ARE what makes modern LLMs work. Every token’s representation dynamically incorporates information from the entire context through self-attention.