Nvidia Blackwell UMMA Architecture Guide - Part One

NVIDIA Blackwell UMMA Architecture Guide - Part One

Overview

This guide covers the fundamental concepts of NVIDIA’s Blackwell GPU architecture, focusing on the transition from Hopper’s WGMMA to Blackwell’s UMMA (Unified Matrix Multiply-Accumulate) instruction and the introduction of Tensor Memory (TMEM).

1. From Hopper WGMMA to Blackwell UMMA

WGMMA (Hopper Architecture)

Full Name: Warp Group Matrix Multiply-Accumulate
Nature: Asynchronous instruction for matrix operations on Tensor Cores
Launch Model: Multi-threaded (multiple threads coordinate to launch)
Benefits of Async: Enables overlap of computation with other work, better resource utilization

UMMA (Blackwell Architecture)

Full Name: Unified Matrix Multiply-Accumulate (CUTLASS terminology for tcgen05.mma)
Why tcgen05: Tensor Core Generation 5 (Blackwell = 5th gen Tensor Cores)
Launch Model: Single-threaded (only one thread launches the operation)
Operations Supported:
- D = A × B + D (multiply-accumulate)
- D = A × B (multiply only)

Key Evolution: TMA → UMMA Analogy

TMA (Tensor Memory Accelerator): Made data copying single-threaded and register-efficient
UMMA: Applies the same principles to matrix operations
Both follow the pattern: offload complexity from software to dedicated hardware

2. Tensor Memory (TMEM)

What is TMEM?

Definition: Dedicated on-chip memory for UMMA accumulation operations
Purpose: Fast storage for intermediate matrix computation results
Capacity: 128 rows (fixed) × variable columns

TMEM Allocation

// Allocation syntax
tcgen05.alloc.b32 %tmem_descriptor, num_columns;

// Requirements:
// - Minimum 32 columns
// - Must be power of 2 (32, 64, 128, 256, etc.)
// - Allocation returns a descriptor/address
// - Must explicitly deallocate with tcgen05.dealloc

TMEM vs Other Memory Types

TMEM ≠ Shared Memory
├── TMEM: Dedicated tensor computation space
└── Shared Memory: Stores TMEM descriptors/addresses for coordination

Memory Access Restrictions

Per-Warp Access: Each warp can only access specific lanes
- Warp 0: Lanes 0-31
- Warp 1: Lanes 32-63
- Warp 2: Lanes 64-95
- Warp 3: Lanes 96-127
Implication: TMEM cannot be used for inter-warp data exchange

3. UMMA Operation Details

Matrix Operation Capabilities

Supported Shapes:
- 64 × N × 16 (N = multiple of 8, max 256)
- 128 × N × 16 (N = multiple of 16, max 256)
Largest Atom: 128 × 256 × 16 (twice the size of largest WGMMA)

Performance Optimization

Pipeline Efficiency: Largest UMMA uses only 50% of TMEM
Benefit: Multiple UMMA operations can pipeline without performance loss
Result: Continuous execution, maximum throughput

Input Descriptors

Matrix Descriptors: 64-bit values containing address, layout, and swizzling info
Special Case: If matrix A comes from TMEM, descriptor is replaced by simple TMEM address
Instruction Descriptor: 32-bit metadata containing:
- Data type and sparsity information
- Transpose/negate flags for A and B matrices
- Accumulation control (enable-input-d)

4. Key Features and Capabilities

Data Layout and Swizzling

Swizzling: Data rearrangement to optimize hardware access patterns
Purpose: Avoid memory bank conflicts, enable coalesced access
Expected Layout: K-major format in shared memory
Hardware Transpose: “Free” transpose during memory read (no computation cost)

Advanced Features

Sparsity Support: Hardware optimization for matrices with many zeros
Transpose/Negate: Built-in matrix transformations during operation
Accumulation Control:
- Zero out: D = A × B (fresh start)
- Accumulate: D = A × B + D (add to existing)

CTA Pairs and Multi-SM Coordination

CTA Pair: Two adjacent CTAs within an SM cluster working together
Launch Model: Even with CTA pairs, only one thread in one CTA launches UMMA
Hardware Coordination: Automatic coordination between CTAs

5. Memory Movement Operations

TMEM Data Flow

Data IN:  UMMA operations → TMEM
Data OUT: tcgen05.ld → RMEM (registers)
Manual:   tcgen05.cp (SMEM→TMEM), tcgen05.st (RMEM→TMEM)

Memory Space Terminology

GMEM: Global Memory
SMEM: Shared Memory
TMEM: Tensor Memory
RMEM: Register Memory (registers)

6. Epilogue Processing

Definition

Epilogue: Post-processing operations after main matrix multiplication

Activation functions (ReLU, sigmoid)
Bias addition, scaling
Data type conversion
Storage to global memory

Warpgroup Requirement

Problem: Large UMMA results span entire TMEM (all 128 lanes)
Solution: Entire warpgroup (4 warps) needed for epilogue
Process:
1. Each warp reads its ¼ of TMEM (32 lanes)
2. Each warp processes its portion independently
3. Each warp stores results to global memory

7. Programming Model Simplification

Before (WGMMA)

Multi-threaded coordination required
Complex register management across threads
Higher software complexity

After (UMMA)

Single-threaded launch
Hardware manages complexity
Simplified programming model
Register-efficient design

Next: Part Two Preview

The next part will cover:

2-CTA UMMA operations
Advanced CUTLASS utilities
Detailed swizzling patterns
Performance optimization strategies