Skip to content

Releases: NVIDIA/cutlass

CUTLASS 3.2

28 Aug 00:50
3a8f57a
Compare
Choose a tag to compare
  • New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs.
  • New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
  • Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
  • Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler.
  • Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
  • Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
  • New CUTLASS 2D Convolution Python interface. New example here.
  • Support for Windows (MSVC) builds.

CUTLASS 3.1

24 May 20:10
6f47420
Compare
Choose a tag to compare
  • New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
  • New efficient epilogues using TMA for Hopper.
  • Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
  • New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
  • New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
  • An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
  • Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
  • Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
  • Performance optimizations for the warp-specialized persistent ping-pong kernel.
  • Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying Params structs.
  • FMHA Backward Pass from Meta xFormers.
  • Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
  • Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
  • Batched Strided GEMV support both row major and column major input matrix.
  • Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
  • Row Broadcast can be fused in the epilogue.
  • The GitHub branch is renamed from master to main in this release.
  • Optimal performance using CUDA 12.1
  • Updates and bugfixes from the community (thanks!)

CUTLASS 3.0

10 Mar 04:19
c4f6b8c
Compare
Choose a tag to compare

3.0.0 (2023-01-23)

CUTLASS 2.11

20 Jan 21:35
66d9cdd
Compare
Choose a tag to compare

2.11.0 (2022-11-19)

  • Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.

  • Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.

  • Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.

  • Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.

  • BLAS3 functions with Hoppers new double precision matrix multiplication instructions.

  • ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.

  • Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.

  • Optimized DepthWise Conv. Two new modes are added

    • kOptimized - use direct conv to compute instead of implicit GEMM.
      • The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
    • kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
      • The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
  • Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.

  • FP8 data type definition and conversion routines.

  • Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.

  • Deprecation announcement: CUTLASS plans to deprecate the following:

    • Maxwell and Pascal GPU architectures
    • Ubuntu 16.04
    • CUDA 10.2

CUTLASS 2.10.0

16 Sep 02:42
fc9ebc6
Compare
Choose a tag to compare

CUTLASS 2.10.0

CUTLASS Python now supports GEMM, Convolution and Grouped GEMM for different data types as well as different epilogue flavors.
Optimizations for CUTLASS's Grouped GEMM kernel. It can move some scheduling into the host side if applicable.
Optimizations for GEMM+Softmax.
Grouped GEMM for Multihead Attention is a general MHA that does not require equal sequence length in every GEMM.
GEMM + Layer norm fusion for Ampere can fuse the layernorm into GEMMs before and after.
GEMM Epilogue Permutation Fusion can permute the GEMM output before storing.
Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized.
Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now.
Standalone Layernorm and Pooling kernels.
Back-to-back GEMM enhancements.
Updates and bugfixes from the community (thanks!)

CUTLASS 2.9.1

29 Jun 01:15
e45e773
Compare
Choose a tag to compare

Bug fixes, performance tuning, and enhancements to documentation.

CUTLASS 2.9.0

27 Apr 16:31
319a389
Compare
Choose a tag to compare

CUTLASS 2.9.0

CUTLASS 2.8

06 Dec 19:22
5fe09c2
Compare
Choose a tag to compare

CUTLASS 2.7

20 Sep 18:10
2e07c4c
Compare
Choose a tag to compare

2.7.0

CUTLASS 2.6.1

03 Sep 17:27
6c2f8f2
Compare
Choose a tag to compare
  • Arbitrary padding and striding for CUTLASS Strided DGRAD Convolution operator (Analytic Iterators)
  • Tuning for GEMMs fused with partial reductions
  • Corrections and bug fixes reported by the CUTLASS community
    • Thank you for filing these issues!