Description
Description
When specifying arguments to a GEMM kernel via GemmUniversalArguments
, the user may set sm_count
to a particular value in order to carve out multiprocessors for other concurrent work. However, for persistent GEMM kernels I've found that this field is ignored, and all SMs are used regardless of the value. I believe this is because of this conditional branch:
given that the max_active_clusters
is populated by cudaOccupancyMaxActiveClusters
and does not take sm_count
into account. This patch was able to resolve my issue:
diff --git a/include/cutlass/gemm/kernel/tile_scheduler_params.h b/include/cutlass/gemm/kernel/tile_scheduler_params.h
index 9ac78311..1c646009 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler_params.h
+++ b/include/cutlass/gemm/kernel/tile_scheduler_params.h
@@ -263,11 +263,13 @@ struct PersistentTileSchedulerSm90Params {
// In case the maximum number of clusters that could co-exist on the target device is
// already calculated using cudaOccupancyMaxActiveClusters
else if (max_active_clusters != 0) {
+ auto max_launchable_clusters = possibly_truncate(max_active_clusters, sm_count / cluster_size);
+
if (raster_order == RasterOrder::AlongN) {
- launch_grid.y = max_active_clusters * cluster_shape.n();
+ launch_grid.y = max_launchable_clusters * cluster_shape.n();
}
else {
- launch_grid.x = max_active_clusters * cluster_shape.m();
+ launch_grid.x = max_launchable_clusters * cluster_shape.m();
}
CUTLASS_TRACE_HOST("get_grid_shape(): Proposed GridDims by the scheduler using cudaOccupancyMaxActiveClusters = "
"(" << launch_grid.x << ", " << launch_grid.y << ", " << launch_grid.z << ")\n");
but I am not familiar enough with this code to know if it has unintended effects.
Steps/Code to reproduce bug
I don't have a great minimal reproduction example. But this should happen on any persistent GEMM launch that specifies sm_count
and autopopulates max_active_clusters
.
Expected behavior
Persistent GEMMs launched with non-default sm_count
use less than or equal to sm_count
SMs.
Environment details (please complete the following information):
- CUTLASS commit b78588d
- Ubuntu 22.04
- CUDA Toolkit 12.4
- H100
Activity