Nvidia Blackwell UMMA Architecture Guide - Part One
NVIDIA Blackwell UMMA Architecture Guide - Part One
Overview
This guide covers the fundamental concepts of NVIDIA’s Blackwell GPU architecture, focusing on the transition from Hopper’s WGMMA to Blackwell’s UMMA (Unified Matrix Multiply-Accumulate) instruction and the introduction of Tensor Memory (TMEM).
1. From Hopper WGMMA to Blackwell UMMA
WGMMA (Hopper Architecture)
- Full Name: Warp Group Matrix Multiply-Accumulate
- Nature: Asynchronous instruction for matrix operations on Tensor Cores
- Launch Model: Multi-threaded (multiple threads coordinate to launch)
- Benefits of Async: Enables overlap of computation with other work, better resource utilization
UMMA (Blackwell Architecture)
- Full Name: Unified Matrix Multiply-Accumulate (CUTLASS terminology for
tcgen05.mma
) - Why tcgen05: Tensor Core Generation 5 (Blackwell = 5th gen Tensor Cores)
- Launch Model: Single-threaded (only one thread launches the operation)
- Operations Supported:
D = A × B + D
(multiply-accumulate)D = A × B
(multiply only)
Key Evolution: TMA → UMMA Analogy
- TMA (Tensor Memory Accelerator): Made data copying single-threaded and register-efficient
- UMMA: Applies the same principles to matrix operations
- Both follow the pattern: offload complexity from software to dedicated hardware
2. Tensor Memory (TMEM)
What is TMEM?
- Definition: Dedicated on-chip memory for UMMA accumulation operations
- Purpose: Fast storage for intermediate matrix computation results
- Capacity: 128 rows (fixed) × variable columns
TMEM Allocation
// Allocation syntax
tcgen05.alloc.b32 %tmem_descriptor, num_columns;
// Requirements:
// - Minimum 32 columns
// - Must be power of 2 (32, 64, 128, 256, etc.)
// - Allocation returns a descriptor/address
// - Must explicitly deallocate with tcgen05.dealloc
TMEM vs Other Memory Types
TMEM ≠ Shared Memory
├── TMEM: Dedicated tensor computation space
└── Shared Memory: Stores TMEM descriptors/addresses for coordination
Memory Access Restrictions
- Per-Warp Access: Each warp can only access specific lanes
- Warp 0: Lanes 0-31
- Warp 1: Lanes 32-63
- Warp 2: Lanes 64-95
- Warp 3: Lanes 96-127
- Implication: TMEM cannot be used for inter-warp data exchange
3. UMMA Operation Details
Matrix Operation Capabilities
- Supported Shapes:
- 64 × N × 16 (N = multiple of 8, max 256)
- 128 × N × 16 (N = multiple of 16, max 256)
- Largest Atom: 128 × 256 × 16 (twice the size of largest WGMMA)
Performance Optimization
- Pipeline Efficiency: Largest UMMA uses only 50% of TMEM
- Benefit: Multiple UMMA operations can pipeline without performance loss
- Result: Continuous execution, maximum throughput
Input Descriptors
- Matrix Descriptors: 64-bit values containing address, layout, and swizzling info
- Special Case: If matrix A comes from TMEM, descriptor is replaced by simple TMEM address
- Instruction Descriptor: 32-bit metadata containing:
- Data type and sparsity information
- Transpose/negate flags for A and B matrices
- Accumulation control (
enable-input-d
)
4. Key Features and Capabilities
Data Layout and Swizzling
- Swizzling: Data rearrangement to optimize hardware access patterns
- Purpose: Avoid memory bank conflicts, enable coalesced access
- Expected Layout: K-major format in shared memory
- Hardware Transpose: “Free” transpose during memory read (no computation cost)
Advanced Features
- Sparsity Support: Hardware optimization for matrices with many zeros
- Transpose/Negate: Built-in matrix transformations during operation
- Accumulation Control:
- Zero out:
D = A × B
(fresh start) - Accumulate:
D = A × B + D
(add to existing)
- Zero out:
CTA Pairs and Multi-SM Coordination
- CTA Pair: Two adjacent CTAs within an SM cluster working together
- Launch Model: Even with CTA pairs, only one thread in one CTA launches UMMA
- Hardware Coordination: Automatic coordination between CTAs
5. Memory Movement Operations
TMEM Data Flow
Data IN: UMMA operations → TMEM
Data OUT: tcgen05.ld → RMEM (registers)
Manual: tcgen05.cp (SMEM→TMEM), tcgen05.st (RMEM→TMEM)
Memory Space Terminology
- GMEM: Global Memory
- SMEM: Shared Memory
- TMEM: Tensor Memory
- RMEM: Register Memory (registers)
6. Epilogue Processing
Definition
Epilogue: Post-processing operations after main matrix multiplication
- Activation functions (ReLU, sigmoid)
- Bias addition, scaling
- Data type conversion
- Storage to global memory
Warpgroup Requirement
- Problem: Large UMMA results span entire TMEM (all 128 lanes)
- Solution: Entire warpgroup (4 warps) needed for epilogue
- Process:
- Each warp reads its ¼ of TMEM (32 lanes)
- Each warp processes its portion independently
- Each warp stores results to global memory
7. Programming Model Simplification
Before (WGMMA)
- Multi-threaded coordination required
- Complex register management across threads
- Higher software complexity
After (UMMA)
- Single-threaded launch
- Hardware manages complexity
- Simplified programming model
- Register-efficient design
Next: Part Two Preview
The next part will cover:
- 2-CTA UMMA operations
- Advanced CUTLASS utilities
- Detailed swizzling patterns
- Performance optimization strategies