Schedule Reordering Module¶
This module implements a ping-pong scheduling algorithm for optimizing wave-level parallelism in GPU kernels. The algorithm enables two waves within the same SIMD group to run in parallel by carefully orchestrating their execution phases and synchronization.
Most of the module and algorithm described below can be found in wave_lang/kernel/wave/schedule_reordering.py
Overview¶
The scheduling algorithm implements a ping-pong pattern where two waves alternate between different computational phases:
Wave Synchronization Phase:
Wave “High” is initially blocked before the loop
Wave “Low” proceeds with memory operations
When Wave “Low” hits a workgroup barrier after reads, Wave “High” is unblocked
Both waves then proceed in parallel with different tasks
Parallel Execution Phases:
First Cluster:
Wave “Low” performs matrix multiplications (MMAs)
Wave “High” performs local and global memory reads
Second Cluster:
Wave “Low” performs local writes
Wave “High” performs matrix multiplications (MMAs)
Synchronization:
Waves synchronize using workgroup barriers
Conditional barriers control wave execution flow
After synchronization, both waves can proceed to the next iteration
Implementation Details¶
Operation Classification¶
The algorithm classifies operations into four main categories:
Global Reads: Memory operations loading from global memory
Local Reads: Memory operations loading from shared memory
Local Writes: Memory operations writing to shared memory
MMAs: Matrix multiplication operations
Graph Reordering¶
The reordering process follows these steps:
Detection:
Identifies operations within a for-loop
Classifies operations into the four categories
Determines operation dependencies
Clustering:
Groups operations into logical clusters
Reorders clusters to optimize wave parallelism
Maintains data dependencies while enabling parallel execution
Graph Reconstruction:
Preserves operations before the cluster
Inserts reordered cluster operations
Maintains operations after the cluster
Ensures correct execution order while enabling parallel wave execution
Scheduling Strategy¶
The module supports different scheduling strategies:
SchedReorderStrategy.NONE: No reordering appliedSchedReorderStrategy.TWO_PP_CLUSTER: Implements the ping-pong pattern with two clusters
The strategy selection is based on:
Hardware constraints
Tile sizes (M, N, K dimensions)
Wave count requirements
Key Functions¶
- schedule_reordering(trace, constraints, scheduling_type)¶
Main entry point for the scheduling algorithm. Processes the trace and applies the appropriate reordering strategy based on the given constraints.
- transform_two_PP_clusters(mma_nodes, local_load_lhs, local_load_rhs, global_load_lhs, global_load_rhs, local_write_lhs, local_write_rhs)¶
Implements the two-cluster ping-pong transformation, creating the necessary operation clusters and synchronization points.
- add_conditional_barriers_to_loop(custom_iterate, trace, hardware_constraint)¶
Adds conditional barriers to control wave execution flow, implementing the wave synchronization mechanism.
- reorder_graph(graph, clusters)¶
Reconstructs the computation graph with the reordered operations while maintaining correct execution order.
Hardware Requirements¶
The algorithm requires specific hardware characteristics:
Even number of waves per block
Compatible tile sizes for M, N, and K dimensions
Support for wave-level synchronization primitives
The current implementation specifically targets configurations with 8 waves per block.
Example Configuration¶
The default configuration for two-cluster ping-pong scheduling:
Block M: 128
Block N: 256
Block K: 64
Waves per block: 8
Notes¶
The algorithm is specifically designed for prefetch scheduling types
Success of the transformation depends on the ability to properly classify and reorder operations
The implementation includes safety checks to ensure correct execution order is maintained
Wave synchronization is critical for correct execution and is handled through a combination of conditional and workgroup barriers