Inference Process
Understanding Q, K, V in Attention
What Each Represents
- Query (Q): “What information do I need?” - Search request
- Key (K): “What information do I have?” - Advertisement/label
- Value (V): “Here’s the actual information” - Content payload
How They Work Together
- Q × K: Compute attention weights (who should attend to whom)
- Softmax: Normalize attention weights
- Attention × V: Weighted sum of values (what information gets mixed)
Training Perspective
- W_q, W_k, W_v: Three learned transformation matrices
- Same input: Gets transformed three different ways for different purposes
- Learning: Model learns these matrices to solve language modeling task
Example
Token "queen" input: [0.5, 0.8, 0.2, 0.9, ...]
After transformations:
Q = [0.000001, 1, 3, ...] # "I need person/musician info, not royal info"
K = [100, 2, 4, ...] # "I'm very relevant for royal queries"
V = [1, 2, 3, ...] # "I contain: royal=1, person=2, musician=3"
LLM Inference Process
Two-Phase Approach
Phase 1: Prefilling
- Purpose: Process entire input prompt
- Method: All tokens processed simultaneously (parallel)
- Output: Build initial KV cache, generate first response token
- Speed: Fast due to parallelization
Phase 2: Decoding
- Purpose: Generate response tokens one by one
- Method: Sequential processing, append to KV cache
- Output: Complete response
- Speed: Slower due to sequential nature
Complete Example
Input: "System: You are helpful. User: What is the capital of France?"
Prefilling:
- Process all 17 input tokens at once
- Build KV cache: [17 × hidden_size]
- Generate first token: "The"
Decoding:
Time 1: Add "The" → Cache: [18 × hidden_size] → Generate "capital"
Time 2: Add "capital" → Cache: [19 × hidden_size] → Generate "of"
Time 3: Add "of" → Cache: [20 × hidden_size] → Generate "France"
...continue until complete response
Key Insights Gained
- KV Cache is Essential: Enables efficient autoregressive generation
- Pruning is Nuanced: Different strategies (per-channel vs per-token) serve different purposes
- Output-Awareness is Smart: Considers both stored information and current needs
- Q,K,V Have Distinct Roles: Not just different values, but different purposes
- Inference Has Structure: Prefilling vs decoding phases optimize for different constraints
- Everything Connects: From training objectives to inference efficiency to pruning strategies
Practical Applications
- Memory Optimization: Pruning reduces KV cache size for long sequences
- Inference Acceleration: Smaller cache = faster attention computation
- Quality Preservation: Smart pruning maintains model performance
- Scalability: Enables processing of longer contexts within memory constraints
This comprehensive understanding provides the foundation for working with modern LLM optimization techniques and understanding their trade-offs between efficiency and quality.