Buffer Loads, Stores, and L1 Cache Swizzling

This optimization replaces normal vector loads and stores with AMDGPU buffer loads and stores using the amdgpu.fat_raw_buffer_cast operation.

Motivation

On AMD GPUs, buffer loads are often faster than flat global loads that rely on each thread having a VGPR-held address. With a buffer load, a base address is used and per-thread offsets are provided. These offsets are applied relative to a buffer resource descriptor, reducing the overhead of address computation and allowing the hardware to optimize memory access.

This change improves memory performance by:

  • Using buffer loads/stores instead of flat global loads/stores.

  • Optionally enabling the L1 cache swizzle feature to improve cache set utilization.

L1 Cache Swizzling on MI300

On AMD MI300-class architectures, the L1 cache is divided into 4 cache sets. The cache index is determined by bits 7 and 8 of the memory address being accessed.

If the access stride is a power of two greater than or equal to 2^8 (256 bytes), all accesses may map to the same cache set (set 0), leading to cache set conflicts and reduced performance.

To mitigate this, the hardware provides an L1 cache swizzle mechanism:

  • A stride value is provided to the hardware.

  • The hardware uses additional, changing bits of the address to select the cache set.

  • This distributes accesses more evenly across the 4 sets.

Implementation Details

This optimization is implemented in _cast_buffer_and_encode_stride:

  • If the stride of the second-to-last dimension is less than or equal to 8192 bytes then a cache swizzle stride is passed to mlir-op: amdgpu.fat_raw_buffer_cast.

  • Otherwise, a normal fat_raw_buffer_cast is used without swizzling.

  • In both cases, bounds checking is enabled, offsets are reset, and valid_bytes is set to the maximum byte range addressable from the buffer.

Example

Without swizzling:

%buf = amdgpu.fat_raw_buffer_cast %ptr
       bounds_check = true
       reset_offset = true
       valid_bytes = %valid_bytes

With swizzling:

%buf = amdgpu.fat_raw_buffer_cast %ptr
       cache_swizzle_stride = %stride
       bounds_check = true
       reset_offset = true
       valid_bytes = %valid_bytes