-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor stacked version of FP8 Grouped Gemm for reduced overhead #3699
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D69544396 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D69544396 |
e8865f2
to
5e0b36c
Compare
…torch#3699) Summary: Pull Request resolved: pytorch#3699 X-link: facebookresearch/FBGEMM#780 Currently, the stacked version of FP8 grouped gemm accepts lists of tensor inputs and produces a single tensor output. This reduces quite a bit of overhead when cuda graphs are used, but still requires splitting input tensors in prefill which can be costly. This diff updates the input types of stacked grouped gemm to support single tensors. Notably, since M varies across group and we do no padding, this change requires that we provide a new input tensor called `M_offsets` that indicates the row that each group begins at within in the first input. We create M_offsets by taking the cumulative sum of M for each group, which we may be able to further optimize. This diff also includes a long overdue refactor of grouped gemm setup for nvidia such that we only launch a single kernel rather than one per group. This should reduce overhead by quite a bit in some cases. Differential Revision: D69544396
This pull request was exported from Phabricator. Differential Revision: D69544396 |
…torch#3699) Summary: Pull Request resolved: pytorch#3699 X-link: facebookresearch/FBGEMM#780 Currently, the stacked version of FP8 grouped gemm accepts lists of tensor inputs and produces a single tensor output. This reduces quite a bit of overhead when cuda graphs are used, but still requires splitting input tensors in prefill which can be costly. This diff updates the input types of stacked grouped gemm to support single tensors. Notably, since M varies across group and we do no padding, this change requires that we provide a new input tensor called `M_offsets` that indicates the row that each group begins at within in the first input. We create M_offsets by taking the cumulative sum of M for each group, which we may be able to further optimize. This diff also includes a long overdue refactor of grouped gemm setup for nvidia such that we only launch a single kernel rather than one per group. This should reduce overhead by quite a bit in some cases. Differential Revision: D69544396
5e0b36c
to
f5c437a
Compare
…torch#3699) Summary: Pull Request resolved: pytorch#3699 X-link: facebookresearch/FBGEMM#780 Currently, the stacked version of FP8 grouped gemm accepts lists of tensor inputs and produces a single tensor output. This reduces quite a bit of overhead when cuda graphs are used, but still requires splitting input tensors in prefill which can be costly. This diff updates the input types of stacked grouped gemm to support single tensors. Notably, since M varies across group and we do no padding, this change requires that we provide a new input tensor called `M_offsets` that indicates the row that each group begins at within in the first input. We create M_offsets by taking the cumulative sum of M for each group, which we may be able to further optimize. This diff also includes a long overdue refactor of grouped gemm setup for nvidia such that we only launch a single kernel rather than one per group. This should reduce overhead by quite a bit in some cases. Differential Revision: D69544396
This pull request was exported from Phabricator. Differential Revision: D69544396 |
f5c437a
to
d8a3a6f
Compare
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/780
Currently, the stacked version of FP8 grouped gemm accepts lists of tensor inputs and produces a single tensor output. This reduces quite a bit of overhead when cuda graphs are used, but still requires splitting input tensors in prefill which can be costly. This diff updates the input types of stacked grouped gemm to support single tensors. Notably, since M varies across group and we do no padding, this change requires that we provide a new input tensor called
M_offsets
that indicates the row that each group begins at within in the first input. We create M_offsets by taking the cumulative sum of M for each group, which we may be able to further optimize.Differential Revision: D69544396