Buffer Loads, Stores, and L1 Cache Swizzling
============================================

This optimization replaces normal vector loads and stores with **AMDGPU buffer loads and stores** using the
`amdgpu.fat_raw_buffer_cast` operation.

Motivation
----------

On AMD GPUs, **buffer loads** are often faster than flat global loads that rely on each thread having a
VGPR-held address. With a buffer load, a **base address** is used and per-thread **offsets** are provided.
These offsets are applied relative to a **buffer resource descriptor**, reducing the overhead of address
computation and allowing the hardware to optimize memory access.

This change improves memory performance by:

- Using buffer loads/stores instead of flat global loads/stores.
- Optionally enabling the **L1 cache swizzle** feature to improve cache set utilization.

L1 Cache Swizzling on MI300
---------------------------

On AMD MI300-class architectures, the L1 cache is divided into **4 cache sets**.
The **cache index** is determined by bits 7 and 8 of the memory address being accessed.

If the access stride is a power of two greater than or equal to ``2^8`` (256 bytes),
all accesses may map to the same cache set (set 0), leading to **cache set conflicts** and reduced performance.

To mitigate this, the hardware provides an **L1 cache swizzle** mechanism:

- A stride value is provided to the hardware.
- The hardware uses additional, changing bits of the address to select the cache set.
- This distributes accesses more evenly across the 4 sets.

Implementation Details
----------------------

This optimization is implemented in `_cast_buffer_and_encode_stride`:

- If the stride of the second-to-last dimension is less than or equal to ``8192`` bytes
  then a **cache swizzle stride** is passed to mlir-op: `amdgpu.fat_raw_buffer_cast`.
- Otherwise, a normal ``fat_raw_buffer_cast`` is used without swizzling.
- In both cases, bounds checking is enabled, offsets are reset, and ``valid_bytes`` is set
  to the maximum byte range addressable from the buffer.

Example
-------

Without swizzling::

    %buf = amdgpu.fat_raw_buffer_cast %ptr
           bounds_check = true
           reset_offset = true
           valid_bytes = %valid_bytes

With swizzling::

    %buf = amdgpu.fat_raw_buffer_cast %ptr
           cache_swizzle_stride = %stride
           bounds_check = true
           reset_offset = true
           valid_bytes = %valid_bytes