ASM Backend

The ASM backend is Wave’s direct AMDGCN assembly code generation backend that compiles Wave kernels to native AMD GPU assembly instructions. This backend provides fine-grained control over GPU execution and enables advanced optimizations not possible with higher-level backends.

Overview

The ASM backend transforms Wave kernels through the following pipeline:

  1. MLIR Generation: Wave kernels are first compiled to MLIR (Multi-Level Intermediate Representation)

  2. MLIR Analysis: The MLIR is analyzed to extract kernel information, memory access patterns, and thread organization

  3. Assembly Generation: AMDGCN assembly instructions are generated directly from the analyzed MLIR

  4. Binary Compilation: The assembly is compiled to HSACO (Heterogeneous System Architecture Code Object) binaries using AMD’s toolchain

Architecture

The ASM backend consists of several key components:

The ASM backend follows a modular architecture with clear separation of concerns between MLIR analysis, assembly generation, and register management.

Key Components

MLIR Walker (mlir_walker.py)

Analyzes MLIR operations and extracts kernel information including:

  • Function arguments and their types

  • Memory access patterns (loads/stores)

  • Thread ID operations and bounds

  • Affine expressions and their simplifications

  • Binding subspan operations for memory mapping

ASM Emitter (asm_emitter.py)

Generates AMDGCN assembly instructions from kernel information:

  • Kernel preamble and metadata

  • Register allocation and management

  • Memory access instructions (buffer loads/stores)

  • Thread synchronization and control flow

  • Shader Resource Descriptor (SRD) setup

Instruction Classes (instructions.py)

Provides structured representation of AMDGCN instructions:

  • Base instruction classes for different operation types

  • Specific instruction implementations (loads, stores, arithmetic)

  • Instruction builders for common patterns

  • Proper assembly formatting and syntax

Register Allocator (register_allocator.py)

Manages GPU register allocation:

  • Scalar General Purpose Register (SGPR) allocation

  • Vector General Purpose Register (VGPR) allocation

  • Register conflict detection and resolution

  • Alignment requirements for vector operations

  • Architecture-specific granularities (CDNA2/3: VGPR=4, SGPR=8; RDNA2/3: VGPR=4, SGPR=16)

Expression Emitter (expression_emitter.py)

Generic SymPy expression visitor that emits AMDGCN instructions with CSE:

  • Automatic Common Subexpression Elimination (CSE) with memoization

  • Expression canonicalization to maximize CSE hits (flatten, sort, fold constants)

  • Iterative postorder traversal of expression trees

  • Support for constants, symbols, and complex expressions

  • Optimized instruction selection (shifts for power-of-2, masks for modulo)

  • Const marker system for efficient register usage

  • Handles Add, Mul, Mod, floor division, and Pow operations

  • Structural expression keys for cache lookup

  • Error handling for unsupported expression types

Utils (utils.py)

Provides utility functions for:

  • MLIR type parsing and analysis

  • Affine expression simplification using SymPy

  • Thread ID bound analysis

  • Expression-to-assembly conversion

  • SymPy expression building from MLIR indices

  • Byte offset calculation for memory addressing

Handlers (handlers.py)

Operation-specific handlers for MLIR operations:

  • Memory allocation (memref.alloc) including LDS staging

  • Memory views (memref.view) with offset tracking

  • Load operations (vector.load) from global and LDS memory

  • Store operations (vector.store) to global and LDS memory

  • MFMA operations (amdgpu.mfma) with proper synchronization

  • LDS read/write operations (ds_read_b64, ds_write_b64)

Features

Direct Assembly Generation

The ASM backend generates native AMDGCN assembly instructions, providing:

  • Fine-grained control over GPU execution

  • Optimized memory access patterns with proper SRD setup

  • Efficient register usage through intelligent allocation

  • Thread synchronization with proper wait instructions

Advanced Optimizations

The backend implements several optimization techniques:

  • Common Subexpression Elimination (CSE): Automatically detects and reuses identical expressions across operations

  • Expression Canonicalization: Normalizes expressions (flatten, sort, fold) to maximize CSE effectiveness

  • Affine Expression Simplification: Uses SymPy to simplify complex index expressions

  • Thread ID Analysis: Automatically detects and optimizes thread ID usage patterns

  • Memory Access Optimization: Generates efficient buffer load/store sequences with base+offset addressing

  • Register Lifetime Management: Frees temporary VGPRs promptly while preserving cached expressions

  • Register Reuse: Minimizes register pressure through intelligent allocation and CSE

  • Power-of-2 Optimization: Uses bit shifts instead of multiplication for power-of-2 constants

  • Const Marker System: Avoids unnecessary register allocation for integer constants

Memory Management

The ASM backend handles memory operations through:

  • Shader Resource Descriptors (SRDs): Proper setup for buffer access

  • Vectorized Loads/Stores: Efficient 16-byte aligned memory operations

  • Address Calculation: Optimized offset computation using SymPy expressions

  • LDS (Local Data Share) Staging: Automatic staging through shared memory for improved performance

  • Synchronization: Proper wait instructions for memory consistency (vmcnt, lgkmcnt)

  • Expression-Based Addressing: Dynamic address calculation from MLIR affine maps

Hardware Accelerated Operations

The ASM backend provides native support for AMD GPU specialized instructions:

  • MFMA (Matrix Multiply-Accumulate): Hardware-accelerated matrix operations on CDNA architectures using VGPR-variant instructions

  • LDS Operations: Fast shared memory operations (ds_read_b64, ds_write_b64)

  • Multi-Wave Support: Automatic detection and handling of multi-wave workgroups with proper thread ID extraction

  • Multi-Workgroup Support: Dynamic detection of workgroup ID usage and conditional SGPR allocation

Architecture Support

The ASM backend supports multiple AMD GPU architectures with architecture-specific optimizations:

  • CDNA3 (gfx942): MI300 series with VGPR granularity of 4, SGPR granularity of 8

Dynamic Register Allocation

The backend features fully dynamic register allocation:

  • Automatic VGPR Allocation: Computes required VGPRs based on actual usage

  • Automatic SGPR Allocation: Computes required SGPRs based on actual usage

  • Conditional System Register Allocation: Dynamically detects workgroup ID and thread ID usage from MLIR

  • Granularity Alignment: Automatically rounds allocations to architecture-specific granularities

  • VGPR-Variant MFMA: Uses MFMA instructions that write directly to VGPRs, eliminating accumulator complexity

Usage

Basic Usage

To use the ASM backend, specify it in your Wave compilation options:

import wave_lang.kernel.lang as tkl
import wave_lang.kernel.wave as tkw
from wave_lang.kernel.wave.compile import WaveCompileOptions, wave_compile

# Define your kernel
@tkw.wave(constraints)
def my_kernel(a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16],
              b: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]):
    res = tkw.read(a)
    tkw.write(res, b)

# Compile with ASM backend
options = WaveCompileOptions(
    subs={M: 16, N: 16, ADDRESS_SPACE: tkl.AddressSpace.GLOBAL_MEMORY.value},
    backend="asm",
    wave_runtime=True,
    compile_to_mlir=False
)

compiled_kernel = wave_compile(options, my_kernel)

Compilation Options

The ASM backend supports several compilation options:

  • ``backend=”asm”``: Enables the ASM backend

  • ``wave_runtime=True``: Uses Wave’s C++ runtime for execution

  • ``compile_to_mlir=False``: Skips MLIR output, goes directly to assembly

  • ``compile_to_asm=True``: Generates raw assembly text (for debugging)

Example: Simple Copy Kernel

Here’s a complete example of a copy kernel using the ASM backend:

import wave_lang.kernel.lang as tkl
import wave_lang.kernel.wave as tkw
from wave_lang.kernel.wave.compile import WaveCompileOptions, wave_compile
from wave_lang.kernel.wave.utils.torch_utils import device_randn, device_zeros
from torch.testing import assert_close

# Define symbolic dimensions
M = tkl.sym.M
N = tkl.sym.N
ADDRESS_SPACE = tkl.sym.ADDRESS_SPACE

# Hardware constraints
constraints = [
    tkw.HardwareConstraint(
        threads_per_wave=64,
        vector_shapes={M: 16, N: 16}
    ),
    tkw.WorkgroupConstraint(M, 16, 0),
    tkw.WorkgroupConstraint(N, 16, 1),
    tkw.WaveConstraint(M, 16),
    tkw.WaveConstraint(N, 16)
]

# Define the kernel
@tkw.wave(constraints)
def copy_kernel(
    a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16],
    b: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]
):
    """Copy kernel that reads from input and writes to output."""
    res = tkw.read(a)
    tkw.write(res, b)

# Create test data
shape = (16, 16)
a = device_randn(shape, dtype=torch.float16)
b = device_zeros(shape, dtype=torch.float16)

# Compile with ASM backend
options = WaveCompileOptions(
    subs={
        M: shape[0],
        N: shape[1],
        ADDRESS_SPACE: tkl.AddressSpace.GLOBAL_MEMORY.value
    },
    canonicalize=True,
    backend="asm",
    wave_runtime=True,
    compile_to_mlir=False
)

# Compile and execute
compiled_kernel = wave_compile(options, copy_kernel)
compiled_kernel(a, b)

# Verify results
assert_close(a, b)
print("Copy kernel executed successfully!")

Generated Assembly

The ASM backend generates optimized AMDGCN assembly. For the copy kernel above, it produces:

.amdgcn_target "amdgcn-amd-amdhsa--gfx942"
.text
.protected copy_kernel
.globl copy_kernel
.p2align 8
.type copy_kernel,@function

.section .rodata,#alloc
.p2align 6
.amdhsa_kernel copy_kernel
  .amdhsa_user_sgpr_kernarg_segment_ptr 1
  .amdhsa_accum_offset 12
  .amdhsa_next_free_vgpr 12
  .amdhsa_next_free_sgpr 16
  .amdhsa_group_segment_fixed_size 0
  .amdhsa_private_segment_fixed_size 0
  .amdhsa_system_sgpr_workgroup_id_x 1
  .amdhsa_system_sgpr_workgroup_id_y 1
  .amdhsa_system_sgpr_workgroup_id_z 1
  .amdhsa_system_vgpr_workitem_id 0
  .amdhsa_float_denorm_mode_32 3
  .amdhsa_float_denorm_mode_16_64 3
.end_amdhsa_kernel
.text

# SRD upper word (gfx9xx): data_format=4 => 0x20000
.set Srd127_96, 0x20000

copy_kernel:
    s_load_dwordx2 s[2:3], s[0:1], 0x0
    s_load_dwordx2 s[4:5], s[0:1], 0x8
    s_waitcnt lgkmcnt(0)

    # SRD for input buffer (arg0)
    s_mov_b32 s8, s2
    s_mov_b32 s9, s3
    s_mov_b32 s10, 2048
    s_mov_b32 s11, Srd127_96

    # Compute lane ID and vector offset
    v_mbcnt_lo_u32_b32 v0, -1, 0
    v_mbcnt_hi_u32_b32 v0, -1, v0
    v_lshlrev_b32 v2, 5, v0

    # Load data from input buffer
    buffer_load_dwordx4 v[4:7], v2, s[8:11], 0 offen offset:0
    s_waitcnt vmcnt(0)
    buffer_load_dwordx4 v[8:11], v2, s[8:11], 0 offen offset:16
    s_waitcnt vmcnt(0)

    # SRD for output buffer (arg1)
    s_mov_b32 s12, s4
    s_mov_b32 s13, s5
    s_mov_b32 s14, 2048
    s_mov_b32 s15, Srd127_96

    # Store data to output buffer
    buffer_store_dwordx4 v[4:7], v2, s[12:15], 0 offen offset:0
    buffer_store_dwordx4 v[8:11], v2, s[12:15], 0 offen offset:16

    s_endpgm

.amdgpu_metadata
---
amdhsa.version:
  - 1
  - 2
amdhsa.kernels:
  - .name: copy_kernel
    .symbol: 'copy_kernel.kd'
    .language: OpenCL C
    .language_version: [2, 0]
    .args:
      - .name: arg0_ptr
        .size: 8
        .offset: 0
        .value_kind: global_buffer
        .value_type: i8*
      - .name: arg1_ptr
        .size: 8
        .offset: 8
        .value_kind: global_buffer
        .value_type: i8*
    .group_segment_fixed_size: 0
    .kernarg_segment_align: 8
    .kernarg_segment_size: 16
    .max_flat_workgroup_size: 64
    .private_segment_fixed_size: 0
    .sgpr_count: 16
    .sgpr_spill_count: 0
    .vgpr_count: 12
    .vgpr_spill_count: 0
    .wavefront_size: 64
...
.end_amdgpu_metadata

Key Features Demonstrated

This example demonstrates several key features of the ASM backend:

  1. Automatic SRD Setup: The backend automatically generates Shader Resource Descriptor setup for buffer access

  2. Efficient Memory Access: Uses 16-byte aligned buffer load/store instructions for optimal performance

  3. Thread ID Management: Automatically computes lane IDs and vector offsets for memory access

  4. Dynamic Register Allocation: Intelligently allocates SGPRs and VGPRs based on actual usage

  5. Synchronization: Proper wait instructions ensure memory consistency

Example: Matrix Multiply with MFMA

The ASM backend supports hardware-accelerated matrix operations using MFMA instructions on CDNA architectures:

import wave_lang.kernel.lang as tkl
import wave_lang.kernel.wave as tkw
from wave_lang.kernel.wave.compile import WaveCompileOptions, wave_compile

# Define symbolic dimensions
M = tkl.sym.M
N = tkl.sym.N
K = tkl.sym.K
BLOCK_M = tkl.sym.BLOCK_M
BLOCK_N = tkl.sym.BLOCK_N
LOAD_ELEMS_PER_THREAD = tkl.sym.LOAD_ELEMS_PER_THREAD
STORE_ELEMS_PER_THREAD = tkl.sym.STORE_ELEMS_PER_THREAD
ADDRESS_SPACE = tkl.sym.ADDRESS_SPACE

# Hardware constraints for MFMA
constraints = [
    tkw.HardwareConstraint(
        threads_per_wave=64,
        waves_per_block=(1, 1, 1),
        vector_shapes={M: BLOCK_M, N: BLOCK_N}
    ),
    tkw.WorkgroupConstraint(M, BLOCK_M, 0),
    tkw.WorkgroupConstraint(N, BLOCK_N, 1),
    tkw.WaveConstraint(M, BLOCK_M / 1),
    tkw.WaveConstraint(N, BLOCK_N / 1)
]

# Define MMA kernel with LDS staging
@tkw.wave(constraints)
def mma_kernel(
    a: tkl.Memory[M, K, ADDRESS_SPACE, tkl.f16],
    b: tkl.Memory[N, K, ADDRESS_SPACE, tkl.f16],
    c: tkl.Memory[M, N, GLOBAL_ADDRESS_SPACE, tkl.f32]
):
    """Matrix multiply kernel using MFMA with LDS staging."""
    # Allocate LDS for staging
    a_reg = tkw.read(a, elements_per_thread=LOAD_ELEMS_PER_THREAD)
    b_reg = tkw.read(b, elements_per_thread=LOAD_ELEMS_PER_THREAD)

    # Perform MFMA operation
    acc = tkw.mma(a_reg, b_reg)

    # Write results
    tkw.write(acc, c, elements_per_thread=STORE_ELEMS_PER_THREAD)

# Compile with ASM backend
options = WaveCompileOptions(
    subs={
        M: 16, N: 16, K: 16,
        BLOCK_M: 16, BLOCK_N: 16,
        LOAD_ELEMS_PER_THREAD: 4,
        STORE_ELEMS_PER_THREAD: 4,
        ADDRESS_SPACE: tkl.AddressSpace.SHARED_MEMORY.value
    },
    canonicalize=True,
    backend="asm",
    wave_runtime=True,
    compile_to_mlir=False
)

compiled_kernel = wave_compile(options, mma_kernel)

MFMA Key Features

The MFMA support includes:

  1. LDS Staging: Automatically stages data through Local Data Share for optimal MFMA performance

  2. VGPR-Variant MFMA: Uses MFMA instructions that write directly to VGPRs (not accumulators)

  3. Synchronization: Inserts s_waitcnt lgkmcnt(0) before MFMA to ensure LDS reads complete

  4. Multi-Wave Support: Correctly handles multiple waves per workgroup with thread ID extraction

  5. Multi-Workgroup Support: Automatically detects and allocates workgroup ID system SGPRs as needed

  6. Dynamic Metadata: Computes vgpr_count, sgpr_count, and lds_size dynamically

Advanced Features

Multi-Wave and Multi-Workgroup Support

The ASM backend automatically handles complex thread and workgroup configurations:

Multi-Wave Kernels

When a workgroup contains multiple waves (e.g., workgroup_size = [256, 4, 1]), the backend:

  1. Detects Multi-Wave Configuration: Analyzes workgroup size from MLIR translation_info

  2. Requests System VGPRs: Sets .amdhsa_system_vgpr_workitem_id 1 to get flat thread ID in v0

  3. Extracts Thread IDs: Generates code to extract tid_x and tid_y from flat ID:

    • tid_x = v0 & 0x3ff (bits 0-9)

    • tid_y = (v0 >> 10) & 0x3ff (bits 10-19)

  4. Uses in Addressing: Thread IDs are used in affine expressions for memory access

Multi-Workgroup Kernels

When a kernel is dispatched across multiple workgroups (e.g., grid = [16, 16, 1]), the backend:

  1. Detects Workgroup ID Usage: Scans MLIR for gpu.block_id operations

  2. Conditionally Requests System SGPRs: Only requests needed workgroup IDs:

    • .amdhsa_system_sgpr_workgroup_id_x 1 if gpu.block_id x is used

    • .amdhsa_system_sgpr_workgroup_id_y 1 if gpu.block_id y is used

    • .amdhsa_system_sgpr_workgroup_id_z 1 if gpu.block_id z is used

  3. Allocates SGPRs: Places workgroup IDs at s2, s3, s4 (after kernarg pointer)

  4. Uses in Addressing: Workgroup IDs scale memory access for workgroup-local tiles

Example: 256x256 MMA with 4 workgroups (2x2 grid), single wave per workgroup:

  • Each workgroup processes a 64x64 tile

  • wgid_x and wgid_y are detected and allocated

  • Global memory addresses: base + (wgid_x * 64 * 4) + (wgid_y * 64 * 256 * 4) + tid_x

Affine Expression Simplification

The ASM backend uses SymPy to simplify complex affine expressions:

# Complex index expression
@tkw.wave(constraints)
def complex_index_kernel(a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]):
    # This expression gets simplified automatically
    idx = tkl.affine.apply(lambda d0, s0: d0 - (d0 // 64) * 64, [tkl.tid.x])
    res = tkw.read(a[idx, 0])
    tkw.write(res, b[idx, 0])

The backend automatically simplifies d0 - (d0 // 64) * 64 to just d0 when d0 < 64.

Thread ID Analysis

The backend analyzes thread ID usage patterns:

@tkw.wave(constraints)
def thread_id_kernel(a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]):
    # Backend automatically detects this is just tid.x
    idx = tkl.affine.apply(lambda d0: d0, [tkl.tid.x])
    res = tkw.read(a[idx, 0])

The backend recognizes that the affine expression simplifies to the thread ID and generates optimal code.

Expression Visitor System

The ASM backend uses a sophisticated expression visitor (ExprEmitter) to convert SymPy expressions to AMDGCN assembly with automatic Common Subexpression Elimination:

Supported Operations:

  • Constants and Symbols: Direct materialization into registers

  • Addition: Efficient left-to-right accumulation using v_add_u32

  • Multiplication: Power-of-2 uses v_lshlrev_b32 (shift), others use v_mul_lo_u32

  • Modulo: Power-of-2 divisors use v_and_b32 (mask)

  • Floor Division: Power-of-2 divisors use v_lshrrev_b32 (shift)

  • Power of 2: Constant folding in expressions

Optimizations:

  • Common Subexpression Elimination: Automatically caches and reuses emitted expressions

  • Expression Canonicalization: Normalizes expressions to maximize cache hits (flatten Add/Mul, sort operands, fold constants)

  • Structural Expression Keys: Uses structural hashing for efficient cache lookup

  • Const/Dynamic Splitting: Separates constant offsets into instruction immediates

  • Lifetime Management: Frees temporary registers while preserving cached values

  • Const Marker System: Avoids allocating registers for intermediate constants

  • Iterative Postorder Traversal: Efficient expression tree walking

  • Register Reuse: Minimizes temporary register allocation through CSE

  • Instruction Selection: Chooses optimal instructions based on operand types

Example Expression Handling:

# Complex index expression: row*256 + col*4
# where row = tid_x // 16, col = tid_x % 16
# Used multiple times in load/store operations

# The backend automatically:
# 1. Canonicalizes the expression (flatten, sort, fold)
# 2. Simplifies floor division by 16 to right shift by 4
# 3. Simplifies modulo 16 to mask with 15
# 4. Optimizes multiplication by 256 to left shift by 8
# 5. Optimizes multiplication by 4 to left shift by 2
# 6. Caches the result for reuse across multiple operations
# 7. Accumulates results efficiently

# Generated assembly (simplified) - computed once, reused multiple times:
# v_lshrrev_b32 v2, 4, v1      # row = tid_x >> 4 (cached)
# v_lshlrev_b32 v2, 8, v2       # row * 256 (cached)
# v_and_b32 v3, 15, v1          # col = tid_x & 15 (cached)
# v_lshlrev_b32 v3, 2, v3       # col * 4 (cached)
# v_add_u32 v2, v2, v3          # row*256 + col*4 (cached in v2)
#
# # Subsequent uses of the same expression reuse v2:
# buffer_load_dwordx4 v[4:7], v2, s[8:11], 0 offen offset:0
# buffer_store_dwordx4 v[4:7], v2, s[12:15], 0 offen offset:0
# # No duplicate shift/mask instructions emitted!

Performance Considerations

The ASM backend is designed for performance-critical applications:

  • Direct Assembly: Eliminates intermediate compilation steps

  • Common Subexpression Elimination: Eliminates redundant computations by caching and reusing expressions

  • Optimized Instructions: Uses the most efficient AMDGCN instructions (shifts over multiplies, masks over divides)

  • Register Efficiency: Minimizes register pressure through intelligent allocation, CSE, and const marker system

  • Memory Bandwidth: Optimizes memory access patterns with base+offset addressing for maximum throughput

  • Lifetime Management: Frees temporary registers promptly to reduce pressure while preserving cached values

  • Hardware Acceleration: Leverages MFMA for matrix operations on CDNA architectures

  • LDS Staging: Automatically uses shared memory for improved memory access patterns

  • Dynamic Allocation: Computes exact register requirements for minimal resource usage

  • Architecture-Specific: Adapts to hardware granularities for optimal register allocation

VMEM Wait Optimization (Ticket-based vmcnt)

To hide vector memory (VMEM) latency and avoid over-synchronization, the backend uses a ticket-based scheme to place the minimal required s_waitcnt vmcnt(N) right before the first use of loaded data.

  • Each buffer_load_* is assigned a monotonically increasing ticket T.

  • When consuming data from ticket K and the last issued ticket is T, the backend emits s_waitcnt vmcnt(T - K). This allows newer loads to remain in flight while ensuring the data for ticket K is ready.

  • Waits are placed at the first use (e.g., before an LDS write or a compute op), not after the load.

  • The threshold coalesces across nearby uses and is reset when new loads are issued.

Example (conceptual)

# Issue two loads back-to-back (both in-flight)
buffer_load_dwordx4 v[4:7], v2, s[8:11], 0 offen offset:0   # ticket 0
buffer_load_dwordx4 v[8:11], v2, s[8:11], 0 offen offset:16  # ticket 1

# Minimal wait before first use of the data from ticket 0
s_waitcnt vmcnt(1)   # allow one newer load (ticket 1) to remain outstanding
buffer_store_dwordx4 v[4:7], v2, s[12:15], 0 offen offset:0

# Later when using data from ticket 1
s_waitcnt vmcnt(0)
buffer_store_dwordx4 v[8:11], v2, s[12:15], 0 offen offset:16

This placement hides memory latency behind independent address/index computation and reduces the number of waits (and their strictness) compared to always using vmcnt(0) immediately after each load.

Latency-Aware Scheduling (Database-driven)

Beyond VMEM wait placement, the ASM backend employs a database-driven, latency-aware scheduler to minimize stalls across VMEM, LGKM (LDS/scalar), VALU and MFMA pipelines.

  • Latency Database (latency_db/gfx942.json): Versioned, per-architecture JSON containing instruction latencies and throughputs, plus hazard-specific distances (e.g. mfma_to_agpr_read). Each entry has a measurement source: - isa_manual: From AMD ISA documentation - llvm_codegen: From LLVM’s proven codegen patterns - measured / profiled: From microbenchmarks or profiling (future-ready)

  • Latency Provider (latency_provider.py): Query interface used by the emitter/handlers to retrieve latency/throughput and hazard distances. All values come from the JSON database; there are no hardcoded latencies in the code path.

  • Scoreboard (scoreboard.py): Tracks outstanding instructions and their readiness, detects RAW/WAW hazards, and recommends minimal waits/NOPs. It also integrates with the ticket-based VMEM/LGKM model.

  • Always-On Integration (asm_emitter.py): The emitter unconditionally initializes the Latency Provider and Scoreboard. Latency-aware scheduling is always enabled.

MFMA Scheduling Example

The backend uses VGPR-variant MFMA instructions that write results directly to VGPRs, simplifying the instruction sequence and eliminating the need for accumulator transfers. The latency-aware scheduler tracks MFMA execution to ensure proper timing before using results:

# Wait for LDS data before MFMA
s_waitcnt lgkmcnt(0)
# VGPR-variant MFMA writes directly to VGPRs
v_mfma_f32_16x16x16_f16 v[0:3], v[26:27], v[24:25], 0
# Results in v[0:3] are ready after MFMA latency (~8 cycles)
# Scheduler ensures proper timing before using v[0:3]
buffer_store_dwordx4 v[0:3], v2, s[12:15], 0 offen

Benefits

  • Minimal waits and NOPs derived from a single source of truth (the database)

  • Architecture-specific values without code changes

  • VGPR-variant MFMA simplifies instruction sequences and matches LLVM backend behavior

  • Dynamic workgroup/thread ID detection minimizes system register usage

  • Ready to adopt measured data in the future

Best Practices

For optimal performance with the ASM backend:

  1. Use Power-of-2 Dimensions: Enables optimal shift/mask instruction generation

  2. Align Memory Access: 16-byte aligned access patterns maximize bandwidth

  3. Leverage LDS: Use shared memory (ADDRESS_SPACE.SHARED_MEMORY) for frequently accessed data

  4. Enable MFMA: Use matrix operations on CDNA architectures for best performance

  5. Minimize Register Pressure: Keep working sets small to maximize occupancy

  6. Profile and Iterate: Use ROCm profiling tools to identify bottlenecks

Limitations

The ASM backend has some limitations:

  • AMD GPU Only: Only supports AMD GPUs with ROCm

  • Power-of-2 Constraints: Non-power-of-2 modulo and division operations are not supported

  • Expression Complexity: Some very complex affine expressions may not be supported

  • CDNA for MFMA: MFMA operations require CDNA2 or CDNA3 architecture (gfx90a, gfx940, gfx941, gfx942)

  • Dynamic Shapes: Requires concrete shape values at compile time

Troubleshooting

Common Issues and Solutions

Issue: ValueError: Mod divisor must be power-of-two

This occurs when using modulo with non-power-of-2 divisors:

# Bad: modulo by 3 not supported
idx = tid_x % 3  # ERROR

# Good: modulo by power-of-2
idx = tid_x % 16  # OK - uses v_and_b32

Issue: hipErrorNoBinaryForGpu

This can occur due to incorrect register metadata. The backend now computes this automatically, but if you encounter this:

  • Ensure you’re using the correct target architecture (gfx942, gfx90a, etc.)

  • Check that register allocations respect granularity constraints

  • Verify LDS size doesn’t exceed hardware limits

Issue: NaN Results from MFMA

This typically indicates missing synchronization:

  • The backend automatically inserts s_waitcnt lgkmcnt(0) before MFMA

  • Ensure LDS staging is configured correctly with SHARED_MEMORY address space

  • Verify that workgroup size and constraints are properly configured

Issue: Register Allocation Errors

The backend now dynamically computes register requirements:

  • Check amdhsa_next_free_vgpr and amdhsa_next_free_sgpr in generated assembly

  • Ensure allocations are aligned to granularity (VGPR: 4, SGPR: 8 or 16)

  • Verify that workgroup size is properly specified in MLIR translation_info attribute

Debugging

To debug ASM backend issues:

  1. Enable Assembly Output:

    options = WaveCompileOptions(
        # ... other options ...
        compile_to_asm=True
    )
    
  2. Inspect Generated Assembly: Look for register allocations, instruction sequences, and metadata

  3. Use ROCm Tools: rocgdb, rocprof, and rocm-smi for runtime debugging

  4. Check MLIR Output: Enable compile_to_mlir=True to see intermediate representation