ASM Backend
===========

The ASM backend is Wave's direct AMDGCN assembly code generation backend that compiles Wave kernels to native AMD GPU assembly instructions. This backend provides fine-grained control over GPU execution and enables advanced optimizations not possible with higher-level backends.

Overview
--------

The ASM backend transforms Wave kernels through the following pipeline:

1. **MLIR Generation**: Wave kernels are first compiled to MLIR (Multi-Level Intermediate Representation)
2. **MLIR Analysis**: The MLIR is analyzed to extract kernel information, memory access patterns, and thread organization
3. **Assembly Generation**: AMDGCN assembly instructions are generated directly from the analyzed MLIR
4. **Binary Compilation**: The assembly is compiled to HSACO (Heterogeneous System Architecture Code Object) binaries using AMD's toolchain

Architecture
------------

The ASM backend consists of several key components:

The ASM backend follows a modular architecture with clear separation of concerns between MLIR analysis, assembly generation, and register management.

Key Components
~~~~~~~~~~~~~~

**MLIR Walker** (`mlir_walker.py`)
  Analyzes MLIR operations and extracts kernel information including:

  - Function arguments and their types
  - Memory access patterns (loads/stores)
  - Thread ID operations and bounds
  - Affine expressions and their simplifications
  - Binding subspan operations for memory mapping

**ASM Emitter** (`asm_emitter.py`)
  Generates AMDGCN assembly instructions from kernel information:

  - Kernel preamble and metadata
  - Register allocation and management
  - Memory access instructions (buffer loads/stores)
  - Thread synchronization and control flow
  - Shader Resource Descriptor (SRD) setup

**Instruction Classes** (`instructions.py`)
  Provides structured representation of AMDGCN instructions:

  - Base instruction classes for different operation types
  - Specific instruction implementations (loads, stores, arithmetic)
  - Instruction builders for common patterns
  - Proper assembly formatting and syntax

**Register Allocator** (`register_allocator.py`)
  Manages GPU register allocation:

  - Scalar General Purpose Register (SGPR) allocation
  - Vector General Purpose Register (VGPR) allocation
  - Register conflict detection and resolution
  - Alignment requirements for vector operations
  - Architecture-specific granularities (CDNA2/3: VGPR=4, SGPR=8; RDNA2/3: VGPR=4, SGPR=16)

**Expression Emitter** (`expression_emitter.py`)
  Generic SymPy expression visitor that emits AMDGCN instructions with CSE:

  - Automatic Common Subexpression Elimination (CSE) with memoization
  - Expression canonicalization to maximize CSE hits (flatten, sort, fold constants)
  - Iterative postorder traversal of expression trees
  - Support for constants, symbols, and complex expressions
  - Optimized instruction selection (shifts for power-of-2, masks for modulo)
  - Const marker system for efficient register usage
  - Handles Add, Mul, Mod, floor division, and Pow operations
  - Structural expression keys for cache lookup
  - Error handling for unsupported expression types

**Utils** (`utils.py`)
  Provides utility functions for:

  - MLIR type parsing and analysis
  - Affine expression simplification using SymPy
  - Thread ID bound analysis
  - Expression-to-assembly conversion
  - SymPy expression building from MLIR indices
  - Byte offset calculation for memory addressing

**Handlers** (`handlers.py`)
  Operation-specific handlers for MLIR operations:

  - Memory allocation (memref.alloc) including LDS staging
  - Memory views (memref.view) with offset tracking
  - Load operations (vector.load) from global and LDS memory
  - Store operations (vector.store) to global and LDS memory
  - MFMA operations (amdgpu.mfma) with proper synchronization
  - LDS read/write operations (ds_read_b64, ds_write_b64)

Features
--------

Direct Assembly Generation
~~~~~~~~~~~~~~~~~~~~~~~~~~

The ASM backend generates native AMDGCN assembly instructions, providing:

- **Fine-grained control** over GPU execution
- **Optimized memory access patterns** with proper SRD setup
- **Efficient register usage** through intelligent allocation
- **Thread synchronization** with proper wait instructions

Advanced Optimizations
~~~~~~~~~~~~~~~~~~~~~~

The backend implements several optimization techniques:

- **Common Subexpression Elimination (CSE)**: Automatically detects and reuses identical expressions across operations
- **Expression Canonicalization**: Normalizes expressions (flatten, sort, fold) to maximize CSE effectiveness
- **Affine Expression Simplification**: Uses SymPy to simplify complex index expressions
- **Thread ID Analysis**: Automatically detects and optimizes thread ID usage patterns
- **Memory Access Optimization**: Generates efficient buffer load/store sequences with base+offset addressing
- **Register Lifetime Management**: Frees temporary VGPRs promptly while preserving cached expressions
- **Register Reuse**: Minimizes register pressure through intelligent allocation and CSE
- **Power-of-2 Optimization**: Uses bit shifts instead of multiplication for power-of-2 constants
- **Const Marker System**: Avoids unnecessary register allocation for integer constants

Memory Management
~~~~~~~~~~~~~~~~~

The ASM backend handles memory operations through:

- **Shader Resource Descriptors (SRDs)**: Proper setup for buffer access
- **Vectorized Loads/Stores**: Efficient 16-byte aligned memory operations
- **Address Calculation**: Optimized offset computation using SymPy expressions
- **LDS (Local Data Share) Staging**: Automatic staging through shared memory for improved performance
- **Synchronization**: Proper wait instructions for memory consistency (vmcnt, lgkmcnt)
- **Expression-Based Addressing**: Dynamic address calculation from MLIR affine maps

Hardware Accelerated Operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ASM backend provides native support for AMD GPU specialized instructions:

- **MFMA (Matrix Multiply-Accumulate)**: Hardware-accelerated matrix operations on CDNA architectures using VGPR-variant instructions
- **LDS Operations**: Fast shared memory operations (ds_read_b64, ds_write_b64)
- **Multi-Wave Support**: Automatic detection and handling of multi-wave workgroups with proper thread ID extraction
- **Multi-Workgroup Support**: Dynamic detection of workgroup ID usage and conditional SGPR allocation

Architecture Support
~~~~~~~~~~~~~~~~~~~~

The ASM backend supports multiple AMD GPU architectures with architecture-specific optimizations:

- **CDNA3 (gfx942)**: MI300 series with VGPR granularity of 4, SGPR granularity of 8

Dynamic Register Allocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The backend features fully dynamic register allocation:

- **Automatic VGPR Allocation**: Computes required VGPRs based on actual usage
- **Automatic SGPR Allocation**: Computes required SGPRs based on actual usage
- **Conditional System Register Allocation**: Dynamically detects workgroup ID and thread ID usage from MLIR
- **Granularity Alignment**: Automatically rounds allocations to architecture-specific granularities
- **VGPR-Variant MFMA**: Uses MFMA instructions that write directly to VGPRs, eliminating accumulator complexity

Usage
-----

Basic Usage
~~~~~~~~~~~

To use the ASM backend, specify it in your Wave compilation options:

.. code-block:: python

   import wave_lang.kernel.lang as tkl
   import wave_lang.kernel.wave as tkw
   from wave_lang.kernel.wave.compile import WaveCompileOptions, wave_compile

   # Define your kernel
   @tkw.wave(constraints)
   def my_kernel(a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16],
                 b: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]):
       res = tkw.read(a)
       tkw.write(res, b)

   # Compile with ASM backend
   options = WaveCompileOptions(
       subs={M: 16, N: 16, ADDRESS_SPACE: tkl.AddressSpace.GLOBAL_MEMORY.value},
       backend="asm",
       wave_runtime=True,
       compile_to_mlir=False
   )

   compiled_kernel = wave_compile(options, my_kernel)

Compilation Options
~~~~~~~~~~~~~~~~~~~

The ASM backend supports several compilation options:

- **``backend="asm"``**: Enables the ASM backend
- **``wave_runtime=True``**: Uses Wave's C++ runtime for execution
- **``compile_to_mlir=False``**: Skips MLIR output, goes directly to assembly
- **``compile_to_asm=True``**: Generates raw assembly text (for debugging)

Example: Simple Copy Kernel
---------------------------

Here's a complete example of a copy kernel using the ASM backend:

.. code-block:: python

   import wave_lang.kernel.lang as tkl
   import wave_lang.kernel.wave as tkw
   from wave_lang.kernel.wave.compile import WaveCompileOptions, wave_compile
   from wave_lang.kernel.wave.utils.torch_utils import device_randn, device_zeros
   from torch.testing import assert_close

   # Define symbolic dimensions
   M = tkl.sym.M
   N = tkl.sym.N
   ADDRESS_SPACE = tkl.sym.ADDRESS_SPACE

   # Hardware constraints
   constraints = [
       tkw.HardwareConstraint(
           threads_per_wave=64,
           vector_shapes={M: 16, N: 16}
       ),
       tkw.WorkgroupConstraint(M, 16, 0),
       tkw.WorkgroupConstraint(N, 16, 1),
       tkw.WaveConstraint(M, 16),
       tkw.WaveConstraint(N, 16)
   ]

   # Define the kernel
   @tkw.wave(constraints)
   def copy_kernel(
       a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16],
       b: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]
   ):
       """Copy kernel that reads from input and writes to output."""
       res = tkw.read(a)
       tkw.write(res, b)

   # Create test data
   shape = (16, 16)
   a = device_randn(shape, dtype=torch.float16)
   b = device_zeros(shape, dtype=torch.float16)

   # Compile with ASM backend
   options = WaveCompileOptions(
       subs={
           M: shape[0],
           N: shape[1],
           ADDRESS_SPACE: tkl.AddressSpace.GLOBAL_MEMORY.value
       },
       canonicalize=True,
       backend="asm",
       wave_runtime=True,
       compile_to_mlir=False
   )

   # Compile and execute
   compiled_kernel = wave_compile(options, copy_kernel)
   compiled_kernel(a, b)

   # Verify results
   assert_close(a, b)
   print("Copy kernel executed successfully!")

Generated Assembly
~~~~~~~~~~~~~~~~~~

The ASM backend generates optimized AMDGCN assembly. For the copy kernel above, it produces:

.. code-block:: asm

   .amdgcn_target "amdgcn-amd-amdhsa--gfx942"
   .text
   .protected copy_kernel
   .globl copy_kernel
   .p2align 8
   .type copy_kernel,@function

   .section .rodata,#alloc
   .p2align 6
   .amdhsa_kernel copy_kernel
     .amdhsa_user_sgpr_kernarg_segment_ptr 1
     .amdhsa_accum_offset 12
     .amdhsa_next_free_vgpr 12
     .amdhsa_next_free_sgpr 16
     .amdhsa_group_segment_fixed_size 0
     .amdhsa_private_segment_fixed_size 0
     .amdhsa_system_sgpr_workgroup_id_x 1
     .amdhsa_system_sgpr_workgroup_id_y 1
     .amdhsa_system_sgpr_workgroup_id_z 1
     .amdhsa_system_vgpr_workitem_id 0
     .amdhsa_float_denorm_mode_32 3
     .amdhsa_float_denorm_mode_16_64 3
   .end_amdhsa_kernel
   .text

   # SRD upper word (gfx9xx): data_format=4 => 0x20000
   .set Srd127_96, 0x20000

   copy_kernel:
       s_load_dwordx2 s[2:3], s[0:1], 0x0
       s_load_dwordx2 s[4:5], s[0:1], 0x8
       s_waitcnt lgkmcnt(0)

       # SRD for input buffer (arg0)
       s_mov_b32 s8, s2
       s_mov_b32 s9, s3
       s_mov_b32 s10, 2048
       s_mov_b32 s11, Srd127_96

       # Compute lane ID and vector offset
       v_mbcnt_lo_u32_b32 v0, -1, 0
       v_mbcnt_hi_u32_b32 v0, -1, v0
       v_lshlrev_b32 v2, 5, v0

       # Load data from input buffer
       buffer_load_dwordx4 v[4:7], v2, s[8:11], 0 offen offset:0
       s_waitcnt vmcnt(0)
       buffer_load_dwordx4 v[8:11], v2, s[8:11], 0 offen offset:16
       s_waitcnt vmcnt(0)

       # SRD for output buffer (arg1)
       s_mov_b32 s12, s4
       s_mov_b32 s13, s5
       s_mov_b32 s14, 2048
       s_mov_b32 s15, Srd127_96

       # Store data to output buffer
       buffer_store_dwordx4 v[4:7], v2, s[12:15], 0 offen offset:0
       buffer_store_dwordx4 v[8:11], v2, s[12:15], 0 offen offset:16

       s_endpgm

   .amdgpu_metadata
   ---
   amdhsa.version:
     - 1
     - 2
   amdhsa.kernels:
     - .name: copy_kernel
       .symbol: 'copy_kernel.kd'
       .language: OpenCL C
       .language_version: [2, 0]
       .args:
         - .name: arg0_ptr
           .size: 8
           .offset: 0
           .value_kind: global_buffer
           .value_type: i8*
         - .name: arg1_ptr
           .size: 8
           .offset: 8
           .value_kind: global_buffer
           .value_type: i8*
       .group_segment_fixed_size: 0
       .kernarg_segment_align: 8
       .kernarg_segment_size: 16
       .max_flat_workgroup_size: 64
       .private_segment_fixed_size: 0
       .sgpr_count: 16
       .sgpr_spill_count: 0
       .vgpr_count: 12
       .vgpr_spill_count: 0
       .wavefront_size: 64
   ...
   .end_amdgpu_metadata

Key Features Demonstrated
~~~~~~~~~~~~~~~~~~~~~~~~~

This example demonstrates several key features of the ASM backend:

1. **Automatic SRD Setup**: The backend automatically generates Shader Resource Descriptor setup for buffer access
2. **Efficient Memory Access**: Uses 16-byte aligned buffer load/store instructions for optimal performance
3. **Thread ID Management**: Automatically computes lane IDs and vector offsets for memory access
4. **Dynamic Register Allocation**: Intelligently allocates SGPRs and VGPRs based on actual usage
5. **Synchronization**: Proper wait instructions ensure memory consistency

Example: Matrix Multiply with MFMA
-----------------------------------

The ASM backend supports hardware-accelerated matrix operations using MFMA instructions on CDNA architectures:

.. code-block:: python

   import wave_lang.kernel.lang as tkl
   import wave_lang.kernel.wave as tkw
   from wave_lang.kernel.wave.compile import WaveCompileOptions, wave_compile

   # Define symbolic dimensions
   M = tkl.sym.M
   N = tkl.sym.N
   K = tkl.sym.K
   BLOCK_M = tkl.sym.BLOCK_M
   BLOCK_N = tkl.sym.BLOCK_N
   LOAD_ELEMS_PER_THREAD = tkl.sym.LOAD_ELEMS_PER_THREAD
   STORE_ELEMS_PER_THREAD = tkl.sym.STORE_ELEMS_PER_THREAD
   ADDRESS_SPACE = tkl.sym.ADDRESS_SPACE

   # Hardware constraints for MFMA
   constraints = [
       tkw.HardwareConstraint(
           threads_per_wave=64,
           waves_per_block=(1, 1, 1),
           vector_shapes={M: BLOCK_M, N: BLOCK_N}
       ),
       tkw.WorkgroupConstraint(M, BLOCK_M, 0),
       tkw.WorkgroupConstraint(N, BLOCK_N, 1),
       tkw.WaveConstraint(M, BLOCK_M / 1),
       tkw.WaveConstraint(N, BLOCK_N / 1)
   ]

   # Define MMA kernel with LDS staging
   @tkw.wave(constraints)
   def mma_kernel(
       a: tkl.Memory[M, K, ADDRESS_SPACE, tkl.f16],
       b: tkl.Memory[N, K, ADDRESS_SPACE, tkl.f16],
       c: tkl.Memory[M, N, GLOBAL_ADDRESS_SPACE, tkl.f32]
   ):
       """Matrix multiply kernel using MFMA with LDS staging."""
       # Allocate LDS for staging
       a_reg = tkw.read(a, elements_per_thread=LOAD_ELEMS_PER_THREAD)
       b_reg = tkw.read(b, elements_per_thread=LOAD_ELEMS_PER_THREAD)

       # Perform MFMA operation
       acc = tkw.mma(a_reg, b_reg)

       # Write results
       tkw.write(acc, c, elements_per_thread=STORE_ELEMS_PER_THREAD)

   # Compile with ASM backend
   options = WaveCompileOptions(
       subs={
           M: 16, N: 16, K: 16,
           BLOCK_M: 16, BLOCK_N: 16,
           LOAD_ELEMS_PER_THREAD: 4,
           STORE_ELEMS_PER_THREAD: 4,
           ADDRESS_SPACE: tkl.AddressSpace.SHARED_MEMORY.value
       },
       canonicalize=True,
       backend="asm",
       wave_runtime=True,
       compile_to_mlir=False
   )

   compiled_kernel = wave_compile(options, mma_kernel)

MFMA Key Features
~~~~~~~~~~~~~~~~~

The MFMA support includes:

1. **LDS Staging**: Automatically stages data through Local Data Share for optimal MFMA performance
2. **VGPR-Variant MFMA**: Uses MFMA instructions that write directly to VGPRs (not accumulators)
3. **Synchronization**: Inserts ``s_waitcnt lgkmcnt(0)`` before MFMA to ensure LDS reads complete
4. **Multi-Wave Support**: Correctly handles multiple waves per workgroup with thread ID extraction
5. **Multi-Workgroup Support**: Automatically detects and allocates workgroup ID system SGPRs as needed
6. **Dynamic Metadata**: Computes ``vgpr_count``, ``sgpr_count``, and ``lds_size`` dynamically

Advanced Features
-----------------

Multi-Wave and Multi-Workgroup Support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ASM backend automatically handles complex thread and workgroup configurations:

**Multi-Wave Kernels**

When a workgroup contains multiple waves (e.g., ``workgroup_size = [256, 4, 1]``), the backend:

1. **Detects Multi-Wave Configuration**: Analyzes workgroup size from MLIR ``translation_info``
2. **Requests System VGPRs**: Sets ``.amdhsa_system_vgpr_workitem_id 1`` to get flat thread ID in ``v0``
3. **Extracts Thread IDs**: Generates code to extract ``tid_x`` and ``tid_y`` from flat ID:

   - ``tid_x = v0 & 0x3ff`` (bits 0-9)
   - ``tid_y = (v0 >> 10) & 0x3ff`` (bits 10-19)

4. **Uses in Addressing**: Thread IDs are used in affine expressions for memory access

**Multi-Workgroup Kernels**

When a kernel is dispatched across multiple workgroups (e.g., ``grid = [16, 16, 1]``), the backend:

1. **Detects Workgroup ID Usage**: Scans MLIR for ``gpu.block_id`` operations
2. **Conditionally Requests System SGPRs**: Only requests needed workgroup IDs:

   - ``.amdhsa_system_sgpr_workgroup_id_x 1`` if ``gpu.block_id x`` is used
   - ``.amdhsa_system_sgpr_workgroup_id_y 1`` if ``gpu.block_id y`` is used
   - ``.amdhsa_system_sgpr_workgroup_id_z 1`` if ``gpu.block_id z`` is used

3. **Allocates SGPRs**: Places workgroup IDs at ``s2``, ``s3``, ``s4`` (after kernarg pointer)
4. **Uses in Addressing**: Workgroup IDs scale memory access for workgroup-local tiles

**Example**: 256x256 MMA with 4 workgroups (2x2 grid), single wave per workgroup:

- Each workgroup processes a 64x64 tile
- ``wgid_x`` and ``wgid_y`` are detected and allocated
- Global memory addresses: ``base + (wgid_x * 64 * 4) + (wgid_y * 64 * 256 * 4) + tid_x``

Affine Expression Simplification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ASM backend uses SymPy to simplify complex affine expressions:

.. code-block:: python

   # Complex index expression
   @tkw.wave(constraints)
   def complex_index_kernel(a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]):
       # This expression gets simplified automatically
       idx = tkl.affine.apply(lambda d0, s0: d0 - (d0 // 64) * 64, [tkl.tid.x])
       res = tkw.read(a[idx, 0])
       tkw.write(res, b[idx, 0])

The backend automatically simplifies ``d0 - (d0 // 64) * 64`` to just ``d0`` when ``d0 < 64``.

Thread ID Analysis
~~~~~~~~~~~~~~~~~~

The backend analyzes thread ID usage patterns:

.. code-block:: python

   @tkw.wave(constraints)
   def thread_id_kernel(a: tkl.Memory[M, N, ADDRESS_SPACE, tkl.f16]):
       # Backend automatically detects this is just tid.x
       idx = tkl.affine.apply(lambda d0: d0, [tkl.tid.x])
       res = tkw.read(a[idx, 0])

The backend recognizes that the affine expression simplifies to the thread ID and generates optimal code.

Expression Visitor System
~~~~~~~~~~~~~~~~~~~~~~~~~

The ASM backend uses a sophisticated expression visitor (``ExprEmitter``) to convert SymPy expressions to AMDGCN assembly with automatic Common Subexpression Elimination:

**Supported Operations:**

- **Constants and Symbols**: Direct materialization into registers
- **Addition**: Efficient left-to-right accumulation using ``v_add_u32``
- **Multiplication**: Power-of-2 uses ``v_lshlrev_b32`` (shift), others use ``v_mul_lo_u32``
- **Modulo**: Power-of-2 divisors use ``v_and_b32`` (mask)
- **Floor Division**: Power-of-2 divisors use ``v_lshrrev_b32`` (shift)
- **Power of 2**: Constant folding in expressions

**Optimizations:**

- **Common Subexpression Elimination**: Automatically caches and reuses emitted expressions
- **Expression Canonicalization**: Normalizes expressions to maximize cache hits (flatten Add/Mul, sort operands, fold constants)
- **Structural Expression Keys**: Uses structural hashing for efficient cache lookup
- **Const/Dynamic Splitting**: Separates constant offsets into instruction immediates
- **Lifetime Management**: Frees temporary registers while preserving cached values
- **Const Marker System**: Avoids allocating registers for intermediate constants
- **Iterative Postorder Traversal**: Efficient expression tree walking
- **Register Reuse**: Minimizes temporary register allocation through CSE
- **Instruction Selection**: Chooses optimal instructions based on operand types

**Example Expression Handling:**

.. code-block:: python

   # Complex index expression: row*256 + col*4
   # where row = tid_x // 16, col = tid_x % 16
   # Used multiple times in load/store operations

   # The backend automatically:
   # 1. Canonicalizes the expression (flatten, sort, fold)
   # 2. Simplifies floor division by 16 to right shift by 4
   # 3. Simplifies modulo 16 to mask with 15
   # 4. Optimizes multiplication by 256 to left shift by 8
   # 5. Optimizes multiplication by 4 to left shift by 2
   # 6. Caches the result for reuse across multiple operations
   # 7. Accumulates results efficiently

   # Generated assembly (simplified) - computed once, reused multiple times:
   # v_lshrrev_b32 v2, 4, v1      # row = tid_x >> 4 (cached)
   # v_lshlrev_b32 v2, 8, v2       # row * 256 (cached)
   # v_and_b32 v3, 15, v1          # col = tid_x & 15 (cached)
   # v_lshlrev_b32 v3, 2, v3       # col * 4 (cached)
   # v_add_u32 v2, v2, v3          # row*256 + col*4 (cached in v2)
   #
   # # Subsequent uses of the same expression reuse v2:
   # buffer_load_dwordx4 v[4:7], v2, s[8:11], 0 offen offset:0
   # buffer_store_dwordx4 v[4:7], v2, s[12:15], 0 offen offset:0
   # # No duplicate shift/mask instructions emitted!

Performance Considerations
----------------------------

The ASM backend is designed for performance-critical applications:

- **Direct Assembly**: Eliminates intermediate compilation steps
- **Common Subexpression Elimination**: Eliminates redundant computations by caching and reusing expressions
- **Optimized Instructions**: Uses the most efficient AMDGCN instructions (shifts over multiplies, masks over divides)
- **Register Efficiency**: Minimizes register pressure through intelligent allocation, CSE, and const marker system
- **Memory Bandwidth**: Optimizes memory access patterns with base+offset addressing for maximum throughput
- **Lifetime Management**: Frees temporary registers promptly to reduce pressure while preserving cached values
- **Hardware Acceleration**: Leverages MFMA for matrix operations on CDNA architectures
- **LDS Staging**: Automatically uses shared memory for improved memory access patterns
- **Dynamic Allocation**: Computes exact register requirements for minimal resource usage
- **Architecture-Specific**: Adapts to hardware granularities for optimal register allocation

VMEM Wait Optimization (Ticket-based vmcnt)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To hide vector memory (VMEM) latency and avoid over-synchronization, the backend uses a
ticket-based scheme to place the minimal required ``s_waitcnt vmcnt(N)`` right before
the first use of loaded data.

- Each ``buffer_load_*`` is assigned a monotonically increasing ticket ``T``.
- When consuming data from ticket ``K`` and the last issued ticket is ``T``, the backend emits
  ``s_waitcnt vmcnt(T - K)``. This allows newer loads to remain in flight while ensuring
  the data for ticket ``K`` is ready.
- Waits are placed at the first use (e.g., before an LDS write or a compute op), not after the load.
- The threshold coalesces across nearby uses and is reset when new loads are issued.

Example (conceptual)
^^^^^^^^^^^^^^^^^^^^

.. code-block:: asm

   # Issue two loads back-to-back (both in-flight)
   buffer_load_dwordx4 v[4:7], v2, s[8:11], 0 offen offset:0   # ticket 0
   buffer_load_dwordx4 v[8:11], v2, s[8:11], 0 offen offset:16  # ticket 1

   # Minimal wait before first use of the data from ticket 0
   s_waitcnt vmcnt(1)   # allow one newer load (ticket 1) to remain outstanding
   buffer_store_dwordx4 v[4:7], v2, s[12:15], 0 offen offset:0

   # Later when using data from ticket 1
   s_waitcnt vmcnt(0)
   buffer_store_dwordx4 v[8:11], v2, s[12:15], 0 offen offset:16

This placement hides memory latency behind independent address/index computation
and reduces the number of waits (and their strictness) compared to always using
``vmcnt(0)`` immediately after each load.

Latency-Aware Scheduling (Database-driven)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Beyond VMEM wait placement, the ASM backend employs a database-driven, latency-aware
scheduler to minimize stalls across VMEM, LGKM (LDS/scalar), VALU and MFMA pipelines.

- **Latency Database** (``latency_db/gfx942.json``): Versioned, per-architecture JSON
  containing instruction latencies and throughputs, plus hazard-specific distances
  (e.g. ``mfma_to_agpr_read``). Each entry has a measurement source:
  - ``isa_manual``: From AMD ISA documentation
  - ``llvm_codegen``: From LLVM's proven codegen patterns
  - ``measured`` / ``profiled``: From microbenchmarks or profiling (future-ready)

- **Latency Provider** (``latency_provider.py``): Query interface used by the emitter/handlers
  to retrieve latency/throughput and hazard distances. All values come from the JSON database;
  there are no hardcoded latencies in the code path.

- **Scoreboard** (``scoreboard.py``): Tracks outstanding instructions and their readiness,
  detects RAW/WAW hazards, and recommends minimal waits/NOPs. It also integrates with the
  ticket-based VMEM/LGKM model.

- **Always-On Integration** (``asm_emitter.py``): The emitter unconditionally initializes
  the Latency Provider and Scoreboard. Latency-aware scheduling is always enabled.

MFMA Scheduling Example
^^^^^^^^^^^^^^^^^^^^^^^

The backend uses VGPR-variant MFMA instructions that write results directly to VGPRs,
simplifying the instruction sequence and eliminating the need for accumulator transfers.
The latency-aware scheduler tracks MFMA execution to ensure proper timing before using results:

.. code-block:: asm

   # Wait for LDS data before MFMA
   s_waitcnt lgkmcnt(0)
   # VGPR-variant MFMA writes directly to VGPRs
   v_mfma_f32_16x16x16_f16 v[0:3], v[26:27], v[24:25], 0
   # Results in v[0:3] are ready after MFMA latency (~8 cycles)
   # Scheduler ensures proper timing before using v[0:3]
   buffer_store_dwordx4 v[0:3], v2, s[12:15], 0 offen

Benefits
^^^^^^^^

- Minimal waits and NOPs derived from a single source of truth (the database)
- Architecture-specific values without code changes
- VGPR-variant MFMA simplifies instruction sequences and matches LLVM backend behavior
- Dynamic workgroup/thread ID detection minimizes system register usage
- Ready to adopt measured data in the future

Best Practices
~~~~~~~~~~~~~~

For optimal performance with the ASM backend:

1. **Use Power-of-2 Dimensions**: Enables optimal shift/mask instruction generation
2. **Align Memory Access**: 16-byte aligned access patterns maximize bandwidth
3. **Leverage LDS**: Use shared memory (ADDRESS_SPACE.SHARED_MEMORY) for frequently accessed data
4. **Enable MFMA**: Use matrix operations on CDNA architectures for best performance
5. **Minimize Register Pressure**: Keep working sets small to maximize occupancy
6. **Profile and Iterate**: Use ROCm profiling tools to identify bottlenecks

Limitations
-----------

The ASM backend has some limitations:

- **AMD GPU Only**: Only supports AMD GPUs with ROCm
- **Power-of-2 Constraints**: Non-power-of-2 modulo and division operations are not supported
- **Expression Complexity**: Some very complex affine expressions may not be supported
- **CDNA for MFMA**: MFMA operations require CDNA2 or CDNA3 architecture (gfx90a, gfx940, gfx941, gfx942)
- **Dynamic Shapes**: Requires concrete shape values at compile time

Troubleshooting
---------------

Common Issues and Solutions
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Issue: ValueError: Mod divisor must be power-of-two**

This occurs when using modulo with non-power-of-2 divisors:

.. code-block:: python

   # Bad: modulo by 3 not supported
   idx = tid_x % 3  # ERROR

   # Good: modulo by power-of-2
   idx = tid_x % 16  # OK - uses v_and_b32

**Issue: hipErrorNoBinaryForGpu**

This can occur due to incorrect register metadata. The backend now computes this automatically, but if you encounter this:

- Ensure you're using the correct target architecture (gfx942, gfx90a, etc.)
- Check that register allocations respect granularity constraints
- Verify LDS size doesn't exceed hardware limits

**Issue: NaN Results from MFMA**

This typically indicates missing synchronization:

- The backend automatically inserts ``s_waitcnt lgkmcnt(0)`` before MFMA
- Ensure LDS staging is configured correctly with SHARED_MEMORY address space
- Verify that workgroup size and constraints are properly configured

**Issue: Register Allocation Errors**

The backend now dynamically computes register requirements:

- Check ``amdhsa_next_free_vgpr`` and ``amdhsa_next_free_sgpr`` in generated assembly
- Ensure allocations are aligned to granularity (VGPR: 4, SGPR: 8 or 16)
- Verify that workgroup size is properly specified in MLIR ``translation_info`` attribute

Debugging
~~~~~~~~~

To debug ASM backend issues:

1. **Enable Assembly Output**:

   .. code-block:: python

      options = WaveCompileOptions(
          # ... other options ...
          compile_to_asm=True
      )

2. **Inspect Generated Assembly**: Look for register allocations, instruction sequences, and metadata

3. **Use ROCm Tools**: ``rocgdb``, ``rocprof``, and ``rocm-smi`` for runtime debugging

4. **Check MLIR Output**: Enable ``compile_to_mlir=True`` to see intermediate representation