ML Quantization & Optimization Notes

ML Quantization & Optimization Notes

Core Quantization Concepts

Microscaling

  • Definition: Quantization technique applying different scaling factors to small groups of values (2-4 elements)
  • Benefit: More fine-grained precision control vs. tensor-wide scaling
  • Use case: Better handling of outliers while maintaining efficiency

FP4 vs INT4

FP4 (4-bit Floating Point):

  • Structure: Sign + Exponent + Mantissa in 4 bits
  • Common formats: [1|2|1] or [1|3|0] bits
  • Non-uniform value spacing (denser near zero)
  • Better dynamic range and outlier handling

INT4 (4-bit Integer):

  • Fixed-point representation with uniform spacing
  • Simpler computation but prone to saturation
  • Limited dynamic range

FP4 with Microscaling Architecture

Traditional: [S|EE|M] per value (4 bits each)
Microscaled: [EE|EE] + [S|M|S|M|S|M|S|M]
            ^shared    ^individual mantissas
  • Efficiency: ~2.5 bits/value for groups of 4
  • Implementation: Shared exponent decoder + multiple mantissa units

Floating Point Fundamentals

Exponent-Mantissa Structure

Value = (-1)^sign × (1 + mantissa) × 2^(exponent - bias)
  • Exponent: Provides dynamic range (wide magnitude coverage)
  • Mantissa: Provides precision (significant digits)
  • Benefit: Consistent relative precision across magnitudes

Clipping

  • Definition: Constraining values to [min, max] range
  • Applications: Gradient clipping, activation clipping, quantization clipping
  • Purpose: Prevent overflow/saturation issues

Fisher Information & Uncertainty

Fisher Information Matrix

Full Matrix:

F_ij = E[(∂log p(x|θ)/∂θᵢ)(∂log p(x|θ)/∂θⱼ)]
  • Measures parameter sensitivity and information content
  • Off-diagonal terms capture parameter interactions
  • O(n²) computation complexity

Diagonal Approximation:

F_ii = E[(∂log p(x|θ)/∂θᵢ)²]
  • Assumes parameter independence
  • O(n) computation - much more tractable
  • Estimated by averaging squared gradients over calibration data

Why Squared Gradients Work

F_ii ≈ (1/N) Σₙ (∇log p(xₙ|θ)/∂θᵢ)²
  • Empirical approximation of expectation over data distribution
  • Higher values indicate more sensitive/informative parameters
  • Used to guide quantization precision allocation

Uncertainty Quantification

  • Purpose: Measure confidence in model predictions
  • Method: Parameter uncertainty → prediction uncertainty
  • Relationship: Var(θᵢ) ≈ (F⁻¹)ᵢᵢ

Calibration Set Selection

Purpose

  • Determine optimal quantization parameters (scales, zero-points, clipping thresholds)
  • Representative subset for post-training quantization

Selection Criteria

  • Size: 100-1000 samples (accuracy vs efficiency balance)
  • Representativeness: Must capture deployment data distribution
  • Diversity: Cover different modes, edge cases, typical examples
  • Stratification: Ensure all classes/categories represented

Selection Strategies

# Random sampling
calib_set = random.sample(train_set, 1000)

# Stratified sampling
samples_per_class = total_samples // num_classes
calib_set = [random.sample(class_data, samples_per_class) 
             for class_data in grouped_by_class]

# Clustering-based
features = extract_features(train_set)
clusters = kmeans(features, n_clusters=100)
calib_set = [select_representative(cluster) for cluster in clusters]

Best Practices

  • Use validation set (avoid test set leakage)
  • Monitor activation statistics during selection
  • Include domain-specific variations (lighting, vocabulary, etc.)
  • Sometimes create separate “calibration split” during data prep

Key Mathematical Notation

  • E_v[]: Expectation with respect to distribution v
  • ∇log p(x|θ): Gradient of log-likelihood (computed via backpropagation)
  • F_ij: Fisher information between parameters i and j
  • θ: Model parameters