Gather to Shared Memory Optimization

Overview

The gather_to_shared pass enables direct memory loads from global memory to Local Data Store (LDS) without passing through registers, reducing data movement overhead.

This instruction is supported only on specific AMD GPU architectures (gfx94* and gfx95*).

Architecture Support

  • gfx94: Support 32-bit load/store widths

  • gfx95: Support 32-bit, 96-bit, and 128-bit load/store widths

Both architectures also support 8 and 16 bit load widths, but they are zero/sign extended to 32 bit before store, which is not very useful for us.

Instruction Semantics

gather_to_shared is translated to amdgpu.gather_to_lds MLIR op, which is lowered to global_load_lds_* instructions.

Each thread reads 4, 12, or 16 bytes from arbitrary positions in global memory or buffer and writes them contiguously to LDS starting from the address specified in the first thread in wave. Destination addresses in all other threads are ignored.

The operation is asynchronous and AMDGPU backend currently doesn’t enforce any dependencies with other LDS access operations (which may be fixed in the future). Users need to manually insert waitcnt instruction to avoid data races.

This is handled in add_shared_memory_barriers pass. Currently, it will insert waitcnt instruction right before the amdgpu.lds_barrier instruction if it had any preceding amdgpu.gather_to_lds instructions.

Pass Description

gather_to_shared pass works similarly to minimize_global_loads, it takes number of elements that need to be transferred and then divides it by the total number of threads to determine number of elements to be transferred per thread.

Unlike minimize_global_loads it supports a very limited number of elements per thread and only supports a simple contiguous memory layout.

Also, as LDS writes are always contiguous, it doesn’t support padding if the number of elements per wave crosses a row boundary and will undo any LDS padding present in this case.