Skip to content

[BUG] sm_count may be ignored in persistent GEMMs #2108

Open
@milesvant

Description

Description
When specifying arguments to a GEMM kernel via GemmUniversalArguments, the user may set sm_count to a particular value in order to carve out multiprocessors for other concurrent work. However, for persistent GEMM kernels I've found that this field is ignored, and all SMs are used regardless of the value. I believe this is because of this conditional branch:

else if (max_active_clusters != 0) {

given that the max_active_clusters is populated by cudaOccupancyMaxActiveClusters and does not take sm_count into account. This patch was able to resolve my issue:

diff --git a/include/cutlass/gemm/kernel/tile_scheduler_params.h b/include/cutlass/gemm/kernel/tile_scheduler_params.h
index 9ac78311..1c646009 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler_params.h
+++ b/include/cutlass/gemm/kernel/tile_scheduler_params.h
@@ -263,11 +263,13 @@ struct PersistentTileSchedulerSm90Params {
     // In case the maximum number of clusters that could co-exist on the target device is
     // already calculated using cudaOccupancyMaxActiveClusters
     else if (max_active_clusters != 0) {
+      auto max_launchable_clusters = possibly_truncate(max_active_clusters, sm_count / cluster_size);
+
       if (raster_order == RasterOrder::AlongN) {
-        launch_grid.y = max_active_clusters * cluster_shape.n();
+        launch_grid.y = max_launchable_clusters * cluster_shape.n();
       }
       else {
-        launch_grid.x = max_active_clusters * cluster_shape.m();
+        launch_grid.x = max_launchable_clusters * cluster_shape.m();
       }
       CUTLASS_TRACE_HOST("get_grid_shape(): Proposed GridDims by the scheduler using cudaOccupancyMaxActiveClusters = "
           "(" << launch_grid.x << ", " << launch_grid.y << ", " << launch_grid.z << ")\n");

but I am not familiar enough with this code to know if it has unintended effects.

Steps/Code to reproduce bug
I don't have a great minimal reproduction example. But this should happen on any persistent GEMM launch that specifies sm_count and autopopulates max_active_clusters.

Expected behavior
Persistent GEMMs launched with non-default sm_count use less than or equal to sm_count SMs.

Environment details (please complete the following information):

  • CUTLASS commit b78588d
  • Ubuntu 22.04
  • CUDA Toolkit 12.4
  • H100

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions