Inference Process

-claude open

Understanding Q, K, V in Attention

What Each Represents

Query (Q): “What information do I need?” - Search request
Key (K): “What information do I have?” - Advertisement/label
Value (V): “Here’s the actual information” - Content payload

How They Work Together

Q × K: Compute attention weights (who should attend to whom)
Softmax: Normalize attention weights
Attention × V: Weighted sum of values (what information gets mixed)

Training Perspective

W_q, W_k, W_v: Three learned transformation matrices
Same input: Gets transformed three different ways for different purposes
Learning: Model learns these matrices to solve language modeling task

Example

Token "queen" input: [0.5, 0.8, 0.2, 0.9, ...]

After transformations:
Q = [0.000001, 1, 3, ...] # "I need person/musician info, not royal info"
K = [100, 2, 4, ...]      # "I'm very relevant for royal queries"
V = [1, 2, 3, ...]        # "I contain: royal=1, person=2, musician=3"

LLM Inference Process

Two-Phase Approach

Phase 1: Prefilling

Purpose: Process entire input prompt
Method: All tokens processed simultaneously (parallel)
Output: Build initial KV cache, generate first response token
Speed: Fast due to parallelization

Phase 2: Decoding

Purpose: Generate response tokens one by one
Method: Sequential processing, append to KV cache
Output: Complete response
Speed: Slower due to sequential nature

Complete Example

Input: "System: You are helpful. User: What is the capital of France?"

Prefilling:
- Process all 17 input tokens at once
- Build KV cache: [17 × hidden_size]
- Generate first token: "The"

Decoding:
Time 1: Add "The" → Cache: [18 × hidden_size] → Generate "capital"
Time 2: Add "capital" → Cache: [19 × hidden_size] → Generate "of"
Time 3: Add "of" → Cache: [20 × hidden_size] → Generate "France"
...continue until complete response

Key Insights Gained

KV Cache is Essential: Enables efficient autoregressive generation
Pruning is Nuanced: Different strategies (per-channel vs per-token) serve different purposes
Output-Awareness is Smart: Considers both stored information and current needs
Q,K,V Have Distinct Roles: Not just different values, but different purposes
Inference Has Structure: Prefilling vs decoding phases optimize for different constraints
Everything Connects: From training objectives to inference efficiency to pruning strategies

Practical Applications

Memory Optimization: Pruning reduces KV cache size for long sequences
Inference Acceleration: Smaller cache = faster attention computation
Quality Preservation: Smart pruning maintains model performance
Scalability: Enables processing of longer contexts within memory constraints

This comprehensive understanding provides the foundation for working with modern LLM optimization techniques and understanding their trade-offs between efficiency and quality.