2025-07-11

Scaling Laws

Neural Network Scaling Laws - Study Notes

Core Scaling Law Formula

L(X) = (X/X_c)^(-α_X)

Where:

L = Loss (performance metric, lower = better)
X = Scale factor (D for data, N for parameters, C for compute)
X_c = Critical threshold (minimum scale where power laws apply)
α_X = Scaling exponent (determines improvement rate)

Three Key Scaling Dimensions

1. Data Scaling: L(D) = (D/D_c)^(-α_D)

D = Dataset size (training tokens/examples)
D_c = Critical dataset size threshold
α_D ≈ 0.095 for transformers
Doubling data → Loss × 2^(-0.095) ≈ 6.8% improvement

2. Parameter Scaling: L(N) = (N/N_c)^(-α_N)

N = Number of model parameters
N_c = Critical parameter count threshold
α_N ≈ 0.076 for transformers

3. Compute Scaling: L(C) = (C/C_c)^(-α_C)

C = Total compute (FLOPs)
C_c = Critical compute threshold
α_C varies by compute allocation

Chinchilla Optimal Scaling

N_opt ∝ D_opt (approximately 1:1 ratio)

For compute budget C:

N_opt ∝ C^0.5 (optimal parameters)
D_opt ∝ C^0.5 (optimal training tokens)

Key Insights

Diminishing Returns

Small exponents (α < 0.1) mean significant scaling needed for major improvements
10x increase in scale → ~1.25x improvement in loss

Critical Thresholds

X_c exists because very small scales don’t follow power laws
Below threshold = too much noise, above threshold = predictable scaling

Trade-offs

Data scaling: Modest gains, doubles training time
Parameter scaling: Better gains, increases inference cost
Optimal allocation: Balance parameters and data equally

Practical Example

Doubling Training Data:

Loss improvement: 2^(0.095) ≈ 1.068 (6.8% better)
Training time: 2x longer
Inference time: Unchanged

Why Scaling Laws Matter

Predictable performance across orders of magnitude
Resource allocation guidance (don’t overtrain small models)
ROI planning for compute investments
Architecture comparison via scaling exponents