Transformer Hidden Size - Quick Notes
Transformer Hidden Size - Quick Notes
Core Concepts
Hidden State: The vector representation of each token at each layer
- Each token position has its own hidden state vector
- Content evolves through layers, but size stays constant
Hidden Size: The dimensionality of these vectors (e.g., 512, 768, 1024)
- Key architectural parameter
- Determines model width
Size Relationships
Single-Head Attention
- Q, K, V dimensions = hidden_size
- Linear projections: hidden_size → hidden_size
Multi-Head Attention
d_k = d_v = hidden_size / num_heads
- Each head: hidden_size → d_k
- After concat: back to hidden_size
Example: hidden_size=256, heads=4
- Per head: 256/4 = 64
- Q, K, V per head: 64 dimensions
- Concatenated: 4 × 64 = 256
Important Rules
- Hidden size is constant across all transformer layers
- FFN temporarily expands (usually 4×) then contracts back
- Hidden size determines Q/K/V dimensions, not vice versa
- Token positions are sequence indices (0, 1, 2, …)
Flow Example
Input: [batch, seq_len, hidden_size]
Layer 1: [batch, seq_len, hidden_size]
Layer 2: [batch, seq_len, hidden_size]
...
Output: [batch, seq_len, hidden_size]
Size never changes, only content evolves!