ASM Backend¶
The ASM backend is Wave’s direct AMDGCN assembly code generation backend that compiles Wave kernels to native AMD GPU assembly instructions. This backend provides fine-grained control over GPU execution and enables advanced optimizations not possible with higher-level backends.
Overview¶
The ASM backend transforms Wave kernels through the following pipeline:
MLIR Generation: Wave kernels are first compiled to MLIR (Multi-Level Intermediate Representation)
MLIR Analysis: The MLIR is analyzed to extract kernel information, memory access patterns, and thread organization
Assembly Generation: AMDGCN assembly instructions are generated directly from the analyzed MLIR
Binary Compilation: The assembly is compiled to HSACO (Heterogeneous System Architecture Code Object) binaries using AMD’s toolchain
Architecture¶
The ASM backend consists of several key components:
The ASM backend follows a modular architecture with clear separation of concerns between MLIR analysis, assembly generation, and register management.
Key Components¶
- MLIR Walker (mlir_walker.py)
Analyzes MLIR operations and extracts kernel information including:
Function arguments and their types
Memory access patterns (loads/stores)
Thread ID operations and bounds
Affine expressions and their simplifications
Binding subspan operations for memory mapping
- ASM Emitter (asm_emitter.py)
Generates AMDGCN assembly instructions from kernel information:
Kernel preamble and metadata
Register allocation and management
Memory access instructions (buffer loads/stores)
Thread synchronization and control flow
Shader Resource Descriptor (SRD) setup
- Instruction Classes (instructions.py)
Provides structured representation of AMDGCN instructions:
Base instruction classes for different operation types
Specific instruction implementations (loads, stores, arithmetic)
Instruction builders for common patterns
Proper assembly formatting and syntax
- Register Allocator (register_allocator.py)
Manages GPU register allocation:
Scalar General Purpose Register (SGPR) allocation
Vector General Purpose Register (VGPR) allocation
Register conflict detection and resolution
Alignment requirements for vector operations
Architecture-specific granularities (CDNA2/3: VGPR=4, SGPR=8; RDNA2/3: VGPR=4, SGPR=16)
- Expression Emitter (expression_emitter.py)
Generic SymPy expression visitor that emits AMDGCN instructions with CSE:
Automatic Common Subexpression Elimination (CSE) with memoization
Expression canonicalization to maximize CSE hits (flatten, sort, fold constants)
Iterative postorder traversal of expression trees
Support for constants, symbols, and complex expressions
Optimized instruction selection (shifts for power-of-2, masks for modulo)
Const marker system for efficient register usage
Handles Add, Mul, Mod, floor division, and Pow operations
Structural expression keys for cache lookup
Error handling for unsupported expression types
- Utils (utils.py)
Provides utility functions for:
MLIR type parsing and analysis
Affine expression simplification using SymPy
Thread ID bound analysis
Expression-to-assembly conversion
SymPy expression building from MLIR indices
Byte offset calculation for memory addressing
- Handlers (handlers.py)
Operation-specific handlers for MLIR operations:
Memory allocation (memref.alloc) including LDS staging
Memory views (memref.view) with offset tracking
Load operations (vector.load) from global and LDS memory
Store operations (vector.store) to global and LDS memory
MFMA operations (amdgpu.mfma) with proper synchronization
LDS read/write operations (ds_read_b64, ds_write_b64)
Features¶
Direct Assembly Generation¶
The ASM backend generates native AMDGCN assembly instructions, providing:
Fine-grained control over GPU execution
Optimized memory access patterns with proper SRD setup
Efficient register usage through intelligent allocation
Thread synchronization with proper wait instructions
Advanced Optimizations¶
The backend implements several optimization techniques:
Common Subexpression Elimination (CSE): Automatically detects and reuses identical expressions across operations
Expression Canonicalization: Normalizes expressions (flatten, sort, fold) to maximize CSE effectiveness
Affine Expression Simplification: Uses SymPy to simplify complex index expressions
Thread ID Analysis: Automatically detects and optimizes thread ID usage patterns
Memory Access Optimization: Generates efficient buffer load/store sequences with base+offset addressing
Register Lifetime Management: Frees temporary VGPRs promptly while preserving cached expressions
Register Reuse: Minimizes register pressure through intelligent allocation and CSE
Power-of-2 Optimization: Uses bit shifts instead of multiplication for power-of-2 constants
Const Marker System: Avoids unnecessary register allocation for integer constants
Memory Management¶
The ASM backend handles memory operations through:
Shader Resource Descriptors (SRDs): Proper setup for buffer access
Vectorized Loads/Stores: Efficient 16-byte aligned memory operations
Address Calculation: Optimized offset computation using SymPy expressions
LDS (Local Data Share) Staging: Automatic staging through shared memory for improved performance
Synchronization: Proper wait instructions for memory consistency (vmcnt, lgkmcnt)
Expression-Based Addressing: Dynamic address calculation from MLIR affine maps
Hardware Accelerated Operations¶
The ASM backend provides native support for AMD GPU specialized instructions:
MFMA (Matrix Multiply-Accumulate): Hardware-accelerated matrix operations on CDNA architectures using VGPR-variant instructions
LDS Operations: Fast shared memory operations (ds_read_b64, ds_write_b64)
Multi-Wave Support: Automatic detection and handling of multi-wave workgroups with proper thread ID extraction
Multi-Workgroup Support: Dynamic detection of workgroup ID usage and conditional SGPR allocation
Architecture Support¶
The ASM backend supports multiple AMD GPU architectures with architecture-specific optimizations:
CDNA3 (gfx942): MI300 series with VGPR granularity of 4, SGPR granularity of 8
Dynamic Register Allocation¶
The backend features fully dynamic register allocation:
Automatic VGPR Allocation: Computes required VGPRs based on actual usage
Automatic SGPR Allocation: Computes required SGPRs based on actual usage
Conditional System Register Allocation: Dynamically detects workgroup ID and thread ID usage from MLIR
Granularity Alignment: Automatically rounds allocations to architecture-specific granularities
VGPR-Variant MFMA: Uses MFMA instructions that write directly to VGPRs, eliminating accumulator complexity
Usage¶
Basic Usage¶
To use the ASM backend, specify it in your Wave compilation options:
import wave_lang.kernel.lang as tkl
import wave_lang.kernel.wave as tkw
from wave_lang.kernel.wave.compile import WaveCompileOptions, wave_compile
# Define your kernel
@tkw.wave(constraints)
def my_kernel(a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16],
b: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]):
res = tkw.read(a)
tkw.write(res, b)
# Compile with ASM backend
options = WaveCompileOptions(
subs={M: 16, N: 16, ADDRESS_SPACE: tkl.AddressSpace.GLOBAL_MEMORY.value},
backend="asm",
wave_runtime=True,
compile_to_mlir=False
)
compiled_kernel = wave_compile(options, my_kernel)
Compilation Options¶
The ASM backend supports several compilation options:
``backend=”asm”``: Enables the ASM backend
``wave_runtime=True``: Uses Wave’s C++ runtime for execution
``compile_to_mlir=False``: Skips MLIR output, goes directly to assembly
``compile_to_asm=True``: Generates raw assembly text (for debugging)
Example: Simple Copy Kernel¶
Here’s a complete example of a copy kernel using the ASM backend:
import wave_lang.kernel.lang as tkl
import wave_lang.kernel.wave as tkw
from wave_lang.kernel.wave.compile import WaveCompileOptions, wave_compile
from wave_lang.kernel.wave.utils.torch_utils import device_randn, device_zeros
from torch.testing import assert_close
# Define symbolic dimensions
M = tkl.sym.M
N = tkl.sym.N
ADDRESS_SPACE = tkl.sym.ADDRESS_SPACE
# Hardware constraints
constraints = [
tkw.HardwareConstraint(
threads_per_wave=64,
vector_shapes={M: 16, N: 16}
),
tkw.WorkgroupConstraint(M, 16, 0),
tkw.WorkgroupConstraint(N, 16, 1),
tkw.WaveConstraint(M, 16),
tkw.WaveConstraint(N, 16)
]
# Define the kernel
@tkw.wave(constraints)
def copy_kernel(
a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16],
b: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]
):
"""Copy kernel that reads from input and writes to output."""
res = tkw.read(a)
tkw.write(res, b)
# Create test data
shape = (16, 16)
a = device_randn(shape, dtype=torch.float16)
b = device_zeros(shape, dtype=torch.float16)
# Compile with ASM backend
options = WaveCompileOptions(
subs={
M: shape[0],
N: shape[1],
ADDRESS_SPACE: tkl.AddressSpace.GLOBAL_MEMORY.value
},
canonicalize=True,
backend="asm",
wave_runtime=True,
compile_to_mlir=False
)
# Compile and execute
compiled_kernel = wave_compile(options, copy_kernel)
compiled_kernel(a, b)
# Verify results
assert_close(a, b)
print("Copy kernel executed successfully!")
Generated Assembly¶
The ASM backend generates optimized AMDGCN assembly. For the copy kernel above, it produces:
.amdgcn_target "amdgcn-amd-amdhsa--gfx942"
.text
.protected copy_kernel
.globl copy_kernel
.p2align 8
.type copy_kernel,@function
.section .rodata,#alloc
.p2align 6
.amdhsa_kernel copy_kernel
.amdhsa_user_sgpr_kernarg_segment_ptr 1
.amdhsa_accum_offset 12
.amdhsa_next_free_vgpr 12
.amdhsa_next_free_sgpr 16
.amdhsa_group_segment_fixed_size 0
.amdhsa_private_segment_fixed_size 0
.amdhsa_system_sgpr_workgroup_id_x 1
.amdhsa_system_sgpr_workgroup_id_y 1
.amdhsa_system_sgpr_workgroup_id_z 1
.amdhsa_system_vgpr_workitem_id 0
.amdhsa_float_denorm_mode_32 3
.amdhsa_float_denorm_mode_16_64 3
.end_amdhsa_kernel
.text
# SRD upper word (gfx9xx): data_format=4 => 0x20000
.set Srd127_96, 0x20000
copy_kernel:
s_load_dwordx2 s[2:3], s[0:1], 0x0
s_load_dwordx2 s[4:5], s[0:1], 0x8
s_waitcnt lgkmcnt(0)
# SRD for input buffer (arg0)
s_mov_b32 s8, s2
s_mov_b32 s9, s3
s_mov_b32 s10, 2048
s_mov_b32 s11, Srd127_96
# Compute lane ID and vector offset
v_mbcnt_lo_u32_b32 v0, -1, 0
v_mbcnt_hi_u32_b32 v0, -1, v0
v_lshlrev_b32 v2, 5, v0
# Load data from input buffer
buffer_load_dwordx4 v[4:7], v2, s[8:11], 0 offen offset:0
s_waitcnt vmcnt(0)
buffer_load_dwordx4 v[8:11], v2, s[8:11], 0 offen offset:16
s_waitcnt vmcnt(0)
# SRD for output buffer (arg1)
s_mov_b32 s12, s4
s_mov_b32 s13, s5
s_mov_b32 s14, 2048
s_mov_b32 s15, Srd127_96
# Store data to output buffer
buffer_store_dwordx4 v[4:7], v2, s[12:15], 0 offen offset:0
buffer_store_dwordx4 v[8:11], v2, s[12:15], 0 offen offset:16
s_endpgm
.amdgpu_metadata
---
amdhsa.version:
- 1
- 2
amdhsa.kernels:
- .name: copy_kernel
.symbol: 'copy_kernel.kd'
.language: OpenCL C
.language_version: [2, 0]
.args:
- .name: arg0_ptr
.size: 8
.offset: 0
.value_kind: global_buffer
.value_type: i8*
- .name: arg1_ptr
.size: 8
.offset: 8
.value_kind: global_buffer
.value_type: i8*
.group_segment_fixed_size: 0
.kernarg_segment_align: 8
.kernarg_segment_size: 16
.max_flat_workgroup_size: 64
.private_segment_fixed_size: 0
.sgpr_count: 16
.sgpr_spill_count: 0
.vgpr_count: 12
.vgpr_spill_count: 0
.wavefront_size: 64
...
.end_amdgpu_metadata
Key Features Demonstrated¶
This example demonstrates several key features of the ASM backend:
Automatic SRD Setup: The backend automatically generates Shader Resource Descriptor setup for buffer access
Efficient Memory Access: Uses 16-byte aligned buffer load/store instructions for optimal performance
Thread ID Management: Automatically computes lane IDs and vector offsets for memory access
Dynamic Register Allocation: Intelligently allocates SGPRs and VGPRs based on actual usage
Synchronization: Proper wait instructions ensure memory consistency
Example: Matrix Multiply with MFMA¶
The ASM backend supports hardware-accelerated matrix operations using MFMA instructions on CDNA architectures:
import wave_lang.kernel.lang as tkl
import wave_lang.kernel.wave as tkw
from wave_lang.kernel.wave.compile import WaveCompileOptions, wave_compile
# Define symbolic dimensions
M = tkl.sym.M
N = tkl.sym.N
K = tkl.sym.K
BLOCK_M = tkl.sym.BLOCK_M
BLOCK_N = tkl.sym.BLOCK_N
LOAD_ELEMS_PER_THREAD = tkl.sym.LOAD_ELEMS_PER_THREAD
STORE_ELEMS_PER_THREAD = tkl.sym.STORE_ELEMS_PER_THREAD
ADDRESS_SPACE = tkl.sym.ADDRESS_SPACE
# Hardware constraints for MFMA
constraints = [
tkw.HardwareConstraint(
threads_per_wave=64,
waves_per_block=(1, 1, 1),
vector_shapes={M: BLOCK_M, N: BLOCK_N}
),
tkw.WorkgroupConstraint(M, BLOCK_M, 0),
tkw.WorkgroupConstraint(N, BLOCK_N, 1),
tkw.WaveConstraint(M, BLOCK_M / 1),
tkw.WaveConstraint(N, BLOCK_N / 1)
]
# Define MMA kernel with LDS staging
@tkw.wave(constraints)
def mma_kernel(
a: tkl.Memory[M, K, ADDRESS_SPACE, tkl.f16],
b: tkl.Memory[N, K, ADDRESS_SPACE, tkl.f16],
c: tkl.Memory[M, N, GLOBAL_ADDRESS_SPACE, tkl.f32]
):
"""Matrix multiply kernel using MFMA with LDS staging."""
# Allocate LDS for staging
a_reg = tkw.read(a, elements_per_thread=LOAD_ELEMS_PER_THREAD)
b_reg = tkw.read(b, elements_per_thread=LOAD_ELEMS_PER_THREAD)
# Perform MFMA operation
acc = tkw.mma(a_reg, b_reg)
# Write results
tkw.write(acc, c, elements_per_thread=STORE_ELEMS_PER_THREAD)
# Compile with ASM backend
options = WaveCompileOptions(
subs={
M: 16, N: 16, K: 16,
BLOCK_M: 16, BLOCK_N: 16,
LOAD_ELEMS_PER_THREAD: 4,
STORE_ELEMS_PER_THREAD: 4,
ADDRESS_SPACE: tkl.AddressSpace.SHARED_MEMORY.value
},
canonicalize=True,
backend="asm",
wave_runtime=True,
compile_to_mlir=False
)
compiled_kernel = wave_compile(options, mma_kernel)
MFMA Key Features¶
The MFMA support includes:
LDS Staging: Automatically stages data through Local Data Share for optimal MFMA performance
VGPR-Variant MFMA: Uses MFMA instructions that write directly to VGPRs (not accumulators)
Synchronization: Inserts
s_waitcnt lgkmcnt(0)before MFMA to ensure LDS reads completeMulti-Wave Support: Correctly handles multiple waves per workgroup with thread ID extraction
Multi-Workgroup Support: Automatically detects and allocates workgroup ID system SGPRs as needed
Dynamic Metadata: Computes
vgpr_count,sgpr_count, andlds_sizedynamically
Advanced Features¶
Multi-Wave and Multi-Workgroup Support¶
The ASM backend automatically handles complex thread and workgroup configurations:
Multi-Wave Kernels
When a workgroup contains multiple waves (e.g., workgroup_size = [256, 4, 1]), the backend:
Detects Multi-Wave Configuration: Analyzes workgroup size from MLIR
translation_infoRequests System VGPRs: Sets
.amdhsa_system_vgpr_workitem_id 1to get flat thread ID inv0Extracts Thread IDs: Generates code to extract
tid_xandtid_yfrom flat ID:tid_x = v0 & 0x3ff(bits 0-9)tid_y = (v0 >> 10) & 0x3ff(bits 10-19)
Uses in Addressing: Thread IDs are used in affine expressions for memory access
Multi-Workgroup Kernels
When a kernel is dispatched across multiple workgroups (e.g., grid = [16, 16, 1]), the backend:
Detects Workgroup ID Usage: Scans MLIR for
gpu.block_idoperationsConditionally Requests System SGPRs: Only requests needed workgroup IDs:
.amdhsa_system_sgpr_workgroup_id_x 1ifgpu.block_id xis used.amdhsa_system_sgpr_workgroup_id_y 1ifgpu.block_id yis used.amdhsa_system_sgpr_workgroup_id_z 1ifgpu.block_id zis used
Allocates SGPRs: Places workgroup IDs at
s2,s3,s4(after kernarg pointer)Uses in Addressing: Workgroup IDs scale memory access for workgroup-local tiles
Example: 256x256 MMA with 4 workgroups (2x2 grid), single wave per workgroup:
Each workgroup processes a 64x64 tile
wgid_xandwgid_yare detected and allocatedGlobal memory addresses:
base + (wgid_x * 64 * 4) + (wgid_y * 64 * 256 * 4) + tid_x
Affine Expression Simplification¶
The ASM backend uses SymPy to simplify complex affine expressions:
# Complex index expression
@tkw.wave(constraints)
def complex_index_kernel(a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]):
# This expression gets simplified automatically
idx = tkl.affine.apply(lambda d0, s0: d0 - (d0 // 64) * 64, [tkl.tid.x])
res = tkw.read(a[idx, 0])
tkw.write(res, b[idx, 0])
The backend automatically simplifies d0 - (d0 // 64) * 64 to just d0 when d0 < 64.
Thread ID Analysis¶
The backend analyzes thread ID usage patterns:
@tkw.wave(constraints)
def thread_id_kernel(a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]):
# Backend automatically detects this is just tid.x
idx = tkl.affine.apply(lambda d0: d0, [tkl.tid.x])
res = tkw.read(a[idx, 0])
The backend recognizes that the affine expression simplifies to the thread ID and generates optimal code.
Expression Visitor System¶
The ASM backend uses a sophisticated expression visitor (ExprEmitter) to convert SymPy expressions to AMDGCN assembly with automatic Common Subexpression Elimination:
Supported Operations:
Constants and Symbols: Direct materialization into registers
Addition: Efficient left-to-right accumulation using
v_add_u32Multiplication: Power-of-2 uses
v_lshlrev_b32(shift), others usev_mul_lo_u32Modulo: Power-of-2 divisors use
v_and_b32(mask)Floor Division: Power-of-2 divisors use
v_lshrrev_b32(shift)Power of 2: Constant folding in expressions
Optimizations:
Common Subexpression Elimination: Automatically caches and reuses emitted expressions
Expression Canonicalization: Normalizes expressions to maximize cache hits (flatten Add/Mul, sort operands, fold constants)
Structural Expression Keys: Uses structural hashing for efficient cache lookup
Const/Dynamic Splitting: Separates constant offsets into instruction immediates
Lifetime Management: Frees temporary registers while preserving cached values
Const Marker System: Avoids allocating registers for intermediate constants
Iterative Postorder Traversal: Efficient expression tree walking
Register Reuse: Minimizes temporary register allocation through CSE
Instruction Selection: Chooses optimal instructions based on operand types
Example Expression Handling:
# Complex index expression: row*256 + col*4
# where row = tid_x // 16, col = tid_x % 16
# Used multiple times in load/store operations
# The backend automatically:
# 1. Canonicalizes the expression (flatten, sort, fold)
# 2. Simplifies floor division by 16 to right shift by 4
# 3. Simplifies modulo 16 to mask with 15
# 4. Optimizes multiplication by 256 to left shift by 8
# 5. Optimizes multiplication by 4 to left shift by 2
# 6. Caches the result for reuse across multiple operations
# 7. Accumulates results efficiently
# Generated assembly (simplified) - computed once, reused multiple times:
# v_lshrrev_b32 v2, 4, v1 # row = tid_x >> 4 (cached)
# v_lshlrev_b32 v2, 8, v2 # row * 256 (cached)
# v_and_b32 v3, 15, v1 # col = tid_x & 15 (cached)
# v_lshlrev_b32 v3, 2, v3 # col * 4 (cached)
# v_add_u32 v2, v2, v3 # row*256 + col*4 (cached in v2)
#
# # Subsequent uses of the same expression reuse v2:
# buffer_load_dwordx4 v[4:7], v2, s[8:11], 0 offen offset:0
# buffer_store_dwordx4 v[4:7], v2, s[12:15], 0 offen offset:0
# # No duplicate shift/mask instructions emitted!
Performance Considerations¶
The ASM backend is designed for performance-critical applications:
Direct Assembly: Eliminates intermediate compilation steps
Common Subexpression Elimination: Eliminates redundant computations by caching and reusing expressions
Optimized Instructions: Uses the most efficient AMDGCN instructions (shifts over multiplies, masks over divides)
Register Efficiency: Minimizes register pressure through intelligent allocation, CSE, and const marker system
Memory Bandwidth: Optimizes memory access patterns with base+offset addressing for maximum throughput
Lifetime Management: Frees temporary registers promptly to reduce pressure while preserving cached values
Hardware Acceleration: Leverages MFMA for matrix operations on CDNA architectures
LDS Staging: Automatically uses shared memory for improved memory access patterns
Dynamic Allocation: Computes exact register requirements for minimal resource usage
Architecture-Specific: Adapts to hardware granularities for optimal register allocation
VMEM Wait Optimization (Ticket-based vmcnt)¶
To hide vector memory (VMEM) latency and avoid over-synchronization, the backend uses a
ticket-based scheme to place the minimal required s_waitcnt vmcnt(N) right before
the first use of loaded data.
Each
buffer_load_*is assigned a monotonically increasing ticketT.When consuming data from ticket
Kand the last issued ticket isT, the backend emitss_waitcnt vmcnt(T - K). This allows newer loads to remain in flight while ensuring the data for ticketKis ready.Waits are placed at the first use (e.g., before an LDS write or a compute op), not after the load.
The threshold coalesces across nearby uses and is reset when new loads are issued.
Example (conceptual)¶
# Issue two loads back-to-back (both in-flight)
buffer_load_dwordx4 v[4:7], v2, s[8:11], 0 offen offset:0 # ticket 0
buffer_load_dwordx4 v[8:11], v2, s[8:11], 0 offen offset:16 # ticket 1
# Minimal wait before first use of the data from ticket 0
s_waitcnt vmcnt(1) # allow one newer load (ticket 1) to remain outstanding
buffer_store_dwordx4 v[4:7], v2, s[12:15], 0 offen offset:0
# Later when using data from ticket 1
s_waitcnt vmcnt(0)
buffer_store_dwordx4 v[8:11], v2, s[12:15], 0 offen offset:16
This placement hides memory latency behind independent address/index computation
and reduces the number of waits (and their strictness) compared to always using
vmcnt(0) immediately after each load.
Latency-Aware Scheduling (Database-driven)¶
Beyond VMEM wait placement, the ASM backend employs a database-driven, latency-aware scheduler to minimize stalls across VMEM, LGKM (LDS/scalar), VALU and MFMA pipelines.
Latency Database (
latency_db/gfx942.json): Versioned, per-architecture JSON containing instruction latencies and throughputs, plus hazard-specific distances (e.g.mfma_to_agpr_read). Each entry has a measurement source: -isa_manual: From AMD ISA documentation -llvm_codegen: From LLVM’s proven codegen patterns -measured/profiled: From microbenchmarks or profiling (future-ready)Latency Provider (
latency_provider.py): Query interface used by the emitter/handlers to retrieve latency/throughput and hazard distances. All values come from the JSON database; there are no hardcoded latencies in the code path.Scoreboard (
scoreboard.py): Tracks outstanding instructions and their readiness, detects RAW/WAW hazards, and recommends minimal waits/NOPs. It also integrates with the ticket-based VMEM/LGKM model.Always-On Integration (
asm_emitter.py): The emitter unconditionally initializes the Latency Provider and Scoreboard. Latency-aware scheduling is always enabled.
MFMA Scheduling Example¶
The backend uses VGPR-variant MFMA instructions that write results directly to VGPRs, simplifying the instruction sequence and eliminating the need for accumulator transfers. The latency-aware scheduler tracks MFMA execution to ensure proper timing before using results:
# Wait for LDS data before MFMA
s_waitcnt lgkmcnt(0)
# VGPR-variant MFMA writes directly to VGPRs
v_mfma_f32_16x16x16_f16 v[0:3], v[26:27], v[24:25], 0
# Results in v[0:3] are ready after MFMA latency (~8 cycles)
# Scheduler ensures proper timing before using v[0:3]
buffer_store_dwordx4 v[0:3], v2, s[12:15], 0 offen
Benefits¶
Minimal waits and NOPs derived from a single source of truth (the database)
Architecture-specific values without code changes
VGPR-variant MFMA simplifies instruction sequences and matches LLVM backend behavior
Dynamic workgroup/thread ID detection minimizes system register usage
Ready to adopt measured data in the future
Best Practices¶
For optimal performance with the ASM backend:
Use Power-of-2 Dimensions: Enables optimal shift/mask instruction generation
Align Memory Access: 16-byte aligned access patterns maximize bandwidth
Leverage LDS: Use shared memory (ADDRESS_SPACE.SHARED_MEMORY) for frequently accessed data
Enable MFMA: Use matrix operations on CDNA architectures for best performance
Minimize Register Pressure: Keep working sets small to maximize occupancy
Profile and Iterate: Use ROCm profiling tools to identify bottlenecks
Limitations¶
The ASM backend has some limitations:
AMD GPU Only: Only supports AMD GPUs with ROCm
Power-of-2 Constraints: Non-power-of-2 modulo and division operations are not supported
Expression Complexity: Some very complex affine expressions may not be supported
CDNA for MFMA: MFMA operations require CDNA2 or CDNA3 architecture (gfx90a, gfx940, gfx941, gfx942)
Dynamic Shapes: Requires concrete shape values at compile time
Troubleshooting¶
Common Issues and Solutions¶
Issue: ValueError: Mod divisor must be power-of-two
This occurs when using modulo with non-power-of-2 divisors:
# Bad: modulo by 3 not supported
idx = tid_x % 3 # ERROR
# Good: modulo by power-of-2
idx = tid_x % 16 # OK - uses v_and_b32
Issue: hipErrorNoBinaryForGpu
This can occur due to incorrect register metadata. The backend now computes this automatically, but if you encounter this:
Ensure you’re using the correct target architecture (gfx942, gfx90a, etc.)
Check that register allocations respect granularity constraints
Verify LDS size doesn’t exceed hardware limits
Issue: NaN Results from MFMA
This typically indicates missing synchronization:
The backend automatically inserts
s_waitcnt lgkmcnt(0)before MFMAEnsure LDS staging is configured correctly with SHARED_MEMORY address space
Verify that workgroup size and constraints are properly configured
Issue: Register Allocation Errors
The backend now dynamically computes register requirements:
Check
amdhsa_next_free_vgprandamdhsa_next_free_sgprin generated assemblyEnsure allocations are aligned to granularity (VGPR: 4, SGPR: 8 or 16)
Verify that workgroup size is properly specified in MLIR
translation_infoattribute
Debugging¶
To debug ASM backend issues:
Enable Assembly Output:
options = WaveCompileOptions( # ... other options ... compile_to_asm=True )
Inspect Generated Assembly: Look for register allocations, instruction sequences, and metadata
Use ROCm Tools:
rocgdb,rocprof, androcm-smifor runtime debuggingCheck MLIR Output: Enable
compile_to_mlir=Trueto see intermediate representation