Inference Process

-claude open

Understanding Q, K, V in Attention

What Each Represents

  • Query (Q): “What information do I need?” - Search request
  • Key (K): “What information do I have?” - Advertisement/label
  • Value (V): “Here’s the actual information” - Content payload

How They Work Together

  1. Q × K: Compute attention weights (who should attend to whom)
  2. Softmax: Normalize attention weights
  3. Attention × V: Weighted sum of values (what information gets mixed)

Training Perspective

  • W_q, W_k, W_v: Three learned transformation matrices
  • Same input: Gets transformed three different ways for different purposes
  • Learning: Model learns these matrices to solve language modeling task

Example

Token "queen" input: [0.5, 0.8, 0.2, 0.9, ...]

After transformations:
Q = [0.000001, 1, 3, ...] # "I need person/musician info, not royal info"
K = [100, 2, 4, ...]      # "I'm very relevant for royal queries"
V = [1, 2, 3, ...]        # "I contain: royal=1, person=2, musician=3"

LLM Inference Process

Two-Phase Approach

Phase 1: Prefilling

  • Purpose: Process entire input prompt
  • Method: All tokens processed simultaneously (parallel)
  • Output: Build initial KV cache, generate first response token
  • Speed: Fast due to parallelization

Phase 2: Decoding

  • Purpose: Generate response tokens one by one
  • Method: Sequential processing, append to KV cache
  • Output: Complete response
  • Speed: Slower due to sequential nature

Complete Example

Input: "System: You are helpful. User: What is the capital of France?"

Prefilling:
- Process all 17 input tokens at once
- Build KV cache: [17 × hidden_size]
- Generate first token: "The"

Decoding:
Time 1: Add "The" → Cache: [18 × hidden_size] → Generate "capital"
Time 2: Add "capital" → Cache: [19 × hidden_size] → Generate "of"
Time 3: Add "of" → Cache: [20 × hidden_size] → Generate "France"
...continue until complete response

Key Insights Gained

  1. KV Cache is Essential: Enables efficient autoregressive generation
  2. Pruning is Nuanced: Different strategies (per-channel vs per-token) serve different purposes
  3. Output-Awareness is Smart: Considers both stored information and current needs
  4. Q,K,V Have Distinct Roles: Not just different values, but different purposes
  5. Inference Has Structure: Prefilling vs decoding phases optimize for different constraints
  6. Everything Connects: From training objectives to inference efficiency to pruning strategies

Practical Applications

  • Memory Optimization: Pruning reduces KV cache size for long sequences
  • Inference Acceleration: Smaller cache = faster attention computation
  • Quality Preservation: Smart pruning maintains model performance
  • Scalability: Enables processing of longer contexts within memory constraints

This comprehensive understanding provides the foundation for working with modern LLM optimization techniques and understanding their trade-offs between efficiency and quality.