[BUG] sm_count may be ignored in persistent GEMMs

**Description**
When specifying arguments to a GEMM kernel via `GemmUniversalArguments`, the user may set `sm_count` to a particular value in order to carve out multiprocessors for other concurrent work. However, for persistent GEMM kernels I've found that this field is ignored, and all SMs are used regardless of the value. I believe this is because of this conditional branch:

https://github.com/NVIDIA/cutlass/blob/e9627ce55b42fd2599f58cd4396da9380954def0/include/cutlass/gemm/kernel/tile_scheduler_params.h#L1790

given that the `max_active_clusters` is populated by `cudaOccupancyMaxActiveClusters` and does not take `sm_count` into account. This patch was able to resolve my issue:
```
diff --git a/include/cutlass/gemm/kernel/tile_scheduler_params.h b/include/cutlass/gemm/kernel/tile_scheduler_params.h
index 9ac78311..1c646009 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler_params.h
+++ b/include/cutlass/gemm/kernel/tile_scheduler_params.h
@@ -263,11 +263,13 @@ struct PersistentTileSchedulerSm90Params {
     // In case the maximum number of clusters that could co-exist on the target device is
     // already calculated using cudaOccupancyMaxActiveClusters
     else if (max_active_clusters != 0) {
+      auto max_launchable_clusters = possibly_truncate(max_active_clusters, sm_count / cluster_size);
+
       if (raster_order == RasterOrder::AlongN) {
-        launch_grid.y = max_active_clusters * cluster_shape.n();
+        launch_grid.y = max_launchable_clusters * cluster_shape.n();
       }
       else {
-        launch_grid.x = max_active_clusters * cluster_shape.m();
+        launch_grid.x = max_launchable_clusters * cluster_shape.m();
       }
       CUTLASS_TRACE_HOST("get_grid_shape(): Proposed GridDims by the scheduler using cudaOccupancyMaxActiveClusters = "
           "(" << launch_grid.x << ", " << launch_grid.y << ", " << launch_grid.z << ")\n");
```
but I am not familiar enough with this code to know if it has unintended effects.

**Steps/Code to reproduce bug**
I don't have a great minimal reproduction example. But this should happen on any persistent GEMM launch that specifies `sm_count` and autopopulates `max_active_clusters`.

**Expected behavior**
Persistent GEMMs launched with non-default `sm_count` use less than or equal to `sm_count` SMs.

**Environment details (please complete the following information):**
 - CUTLASS commit b78588d1630aa6643bf021613717bafb705df4ef
 - Ubuntu 22.04
 - CUDA Toolkit 12.4
 - H100



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] sm_count may be ignored in persistent GEMMs #2108

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development