Refactor stacked version of FP8 Grouped Gemm for reduced overhead #3699

jwfromm · 2025-02-17T20:36:23Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/780

Currently, the stacked version of FP8 grouped gemm accepts lists of tensor inputs and produces a single tensor output. This reduces quite a bit of overhead when cuda graphs are used, but still requires splitting input tensors in prefill which can be costly. This diff updates the input types of stacked grouped gemm to support single tensors. Notably, since M varies across group and we do no padding, this change requires that we provide a new input tensor called M_offsets that indicates the row that each group begins at within in the first input. We create M_offsets by taking the cumulative sum of M for each group, which we may be able to further optimize.

Differential Revision: D69544396

facebook-github-bot · 2025-02-17T20:36:57Z

This pull request was exported from Phabricator. Differential Revision: D69544396

netlify · 2025-02-17T20:37:21Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`d8a3a6f`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67b3c22c2dc6af000828002e
😎 Deploy Preview	https://deploy-preview-3699--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2025-02-17T22:52:13Z

This pull request was exported from Phabricator. Differential Revision: D69544396

…torch#3699) Summary: Pull Request resolved: pytorch#3699 X-link: facebookresearch/FBGEMM#780 Currently, the stacked version of FP8 grouped gemm accepts lists of tensor inputs and produces a single tensor output. This reduces quite a bit of overhead when cuda graphs are used, but still requires splitting input tensors in prefill which can be costly. This diff updates the input types of stacked grouped gemm to support single tensors. Notably, since M varies across group and we do no padding, this change requires that we provide a new input tensor called `M_offsets` that indicates the row that each group begins at within in the first input. We create M_offsets by taking the cumulative sum of M for each group, which we may be able to further optimize. This diff also includes a long overdue refactor of grouped gemm setup for nvidia such that we only launch a single kernel rather than one per group. This should reduce overhead by quite a bit in some cases. Differential Revision: D69544396

facebook-github-bot · 2025-02-17T22:59:03Z

This pull request was exported from Phabricator. Differential Revision: D69544396

…torch#3699) Summary: Pull Request resolved: pytorch#3699 X-link: facebookresearch/FBGEMM#780 Currently, the stacked version of FP8 grouped gemm accepts lists of tensor inputs and produces a single tensor output. This reduces quite a bit of overhead when cuda graphs are used, but still requires splitting input tensors in prefill which can be costly. This diff updates the input types of stacked grouped gemm to support single tensors. Notably, since M varies across group and we do no padding, this change requires that we provide a new input tensor called `M_offsets` that indicates the row that each group begins at within in the first input. We create M_offsets by taking the cumulative sum of M for each group, which we may be able to further optimize. This diff also includes a long overdue refactor of grouped gemm setup for nvidia such that we only launch a single kernel rather than one per group. This should reduce overhead by quite a bit in some cases. Differential Revision: D69544396

facebook-github-bot · 2025-02-17T23:11:38Z

This pull request was exported from Phabricator. Differential Revision: D69544396

facebook-github-bot added the cla signed label Feb 17, 2025

facebook-github-bot added the fb-exported label Feb 17, 2025

jwfromm force-pushed the export-D69544396 branch from e8865f2 to 5e0b36c Compare February 17, 2025 22:52

jwfromm force-pushed the export-D69544396 branch from 5e0b36c to f5c437a Compare February 17, 2025 22:59

jwfromm force-pushed the export-D69544396 branch from f5c437a to d8a3a6f Compare February 17, 2025 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor stacked version of FP8 Grouped Gemm for reduced overhead #3699

Refactor stacked version of FP8 Grouped Gemm for reduced overhead #3699

jwfromm commented Feb 17, 2025

facebook-github-bot commented Feb 17, 2025

netlify bot commented Feb 17, 2025 •

edited

Loading

facebook-github-bot commented Feb 17, 2025

facebook-github-bot commented Feb 17, 2025

facebook-github-bot commented Feb 17, 2025

Refactor stacked version of FP8 Grouped Gemm for reduced overhead #3699

Are you sure you want to change the base?

Refactor stacked version of FP8 Grouped Gemm for reduced overhead #3699

Conversation

jwfromm commented Feb 17, 2025

facebook-github-bot commented Feb 17, 2025

netlify bot commented Feb 17, 2025 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Feb 17, 2025

facebook-github-bot commented Feb 17, 2025

facebook-github-bot commented Feb 17, 2025

netlify bot commented Feb 17, 2025 •

edited

Loading