Sparsity

How sparsity is employed in Acceleration

When multiplying by zero or very small values, we can avoid the computation entirely.

If 90% percent of values are zero, we can theoretically only need to compute 10% of the operations.

The hardware like GPU support this acceleration in hardware way.

It can:

Reduce memory bandwidth. (no load zero values)
Fewer arithmetic operations. (skip some multiplications and addition)
Lower energy consumption. (based on previous 2 optimization)

Three types of Sparsity in LLM

Weight Sparsity: Zeros in model parameters. These weights pruned during training or post-training.

Activation Sparsity: Zero in intermediate activations during forward pass, typically from RELU-like functions that output zero for negative inputs. This sparsity is input-dependent and changes dynamically based on the data being processed.

Attention Sparsity: Zeros or near-zeros in attention weight matrices. Many attention heads focus on only a subset of tokens, creating natural sparsity patterns. This is also input-dependent and varies across different sequences

Random(Unstructured) and Structured Pattern

Random(Unstructured) Sparsity: zero values appear scattered throughout the tensor without particular pattern. Drawback:

Irregular memory access patterns
Hard to vectorize operations
Requires complex indexing schemes

Structured Pattern: zero values follow regular patterns like entire rows, columns, or blocks being zero (and 2:4).

Advantages:

Regular memory access patterns
Easier to map to hardware parallelism

How Structured Pattern Come Out

Training-Time Methods: Structured Pruning During Training:

Block-wise pruning: remove entire blocks of weights
Channel pruning: Remove entire channels/filters in convolutional layer or attention heads
N:M sparsity: For every M consecutive weights, exactly N are forced to zero (e.g. 2:4)

Regularization with structure constraints: add penalty terms to the loss function that encourage structured patterns:

Group LASSO regularization to zero out entire groups of weights
Structured dropout that follows the desired sparsity pattern

Post-Training Methods: Structured Pruning: Take a dense pre-trained model and apply structured pruning:

Magnitude-based: Within each block/group, keep only the largest weights and zero the rest
Gradient-based: Use gradient information to decide which structure to prune
Fisher information: Use second-order information to make more informed pruning decisions

Knowledge Distillation: Train a sparse student model with structured constraints to mimic a dense teacher model.

# code of 2:4 sparsity, how did it derived

def apply_2_4_sparsity(weights):
	#Reshape to group of 4
	groups = weights.reshape(-1, 4)
	# Find 2 smallest magnitude weights
	indices = np.argsort(np.abs(group))[:2]
	# Zero them out
	group[indices] = 0
	return weights.reshape(original_shape)