Scaling Laws
Neural Network Scaling Laws - Study Notes
Core Scaling Law Formula
L(X) = (X/X_c)^(-α_X)
Where:
- L = Loss (performance metric, lower = better)
- X = Scale factor (D for data, N for parameters, C for compute)
- X_c = Critical threshold (minimum scale where power laws apply)
- α_X = Scaling exponent (determines improvement rate)
Three Key Scaling Dimensions
1. Data Scaling: L(D) = (D/D_c)^(-α_D)
- D = Dataset size (training tokens/examples)
- D_c = Critical dataset size threshold
- α_D ≈ 0.095 for transformers
- Doubling data → Loss × 2^(-0.095) ≈ 6.8% improvement
2. Parameter Scaling: L(N) = (N/N_c)^(-α_N)
- N = Number of model parameters
- N_c = Critical parameter count threshold
- α_N ≈ 0.076 for transformers
3. Compute Scaling: L(C) = (C/C_c)^(-α_C)
- C = Total compute (FLOPs)
- C_c = Critical compute threshold
- α_C varies by compute allocation
Chinchilla Optimal Scaling
N_opt ∝ D_opt (approximately 1:1 ratio)
For compute budget C:
- N_opt ∝ C^0.5 (optimal parameters)
- D_opt ∝ C^0.5 (optimal training tokens)
Key Insights
Diminishing Returns
- Small exponents (α < 0.1) mean significant scaling needed for major improvements
- 10x increase in scale → ~1.25x improvement in loss
Critical Thresholds
- X_c exists because very small scales don’t follow power laws
- Below threshold = too much noise, above threshold = predictable scaling
Trade-offs
- Data scaling: Modest gains, doubles training time
- Parameter scaling: Better gains, increases inference cost
- Optimal allocation: Balance parameters and data equally
Practical Example
Doubling Training Data:
- Loss improvement: 2^(0.095) ≈ 1.068 (6.8% better)
- Training time: 2x longer
- Inference time: Unchanged
Why Scaling Laws Matter
- Predictable performance across orders of magnitude
- Resource allocation guidance (don’t overtrain small models)
- ROI planning for compute investments
- Architecture comparison via scaling exponents