Feature request: Consider privatization instead of forwarding in fusion segmentation

Input forwarding helps segmentation avoid unnecessary intermediate tensors, but it also hides those forwarded ops from the segmenter and schedulers, which can result in performance issues.

For example, if a cast op from bf16 to fp32 is forwarded, the normalization schedulers would need to assume a 2x larger persistent buffer because it wouldn't do the input projection.

Here's a repro with the inner normalization scheduler. Run this as `SEG=1 S=$((64 * 1024)) NVFUSER_DUMP=segmented_fusion ./bin/nvfuser_tests --gtest_filter='*FowardingMiss*`. The `S` parameter needs to be adjusted depending on the actual GPU. Choose the size that would fit in the shared memory if the data type is bfloat16 but not with float, which is 64K with RTX 6000.

```
TEST_F(SegmentationTest, FowardingMissProjectionToLowerPrecisionInput) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto tv0 = makeSymbolicTensor(2, DataType::BFloat16);
  fusion.addInput(tv0);

  auto tv1 = castOp(DataType::Float, tv0);
  auto tv2 = set(tv1);
  auto tv3 = sum(tv2, {1});
  auto tv4 = broadcast(tv3, {false, true});
  auto tv5 = add(tv2, tv4);
  auto tv6 = castOp(DataType::BFloat16, tv5);
  fusion.addOutput(tv6);

  // Forces segmentation
  if (getenv("SEG")) {
    auto tv7 = makeSymbolicTensor(1, DataType::BFloat16);
    fusion.addInput(tv7);
    fusion.addOutput(segment_set(tv7));
  }

  int64_t size = atoi(getenv("S"));
  auto options = at::TensorOptions().dtype(at::kBFloat16).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn({128, size}, options);
  std::vector<c10::IValue> inputs = {t0};
  if (getenv("SEG")) {
    at::Tensor t1 = at::randn({10}, options);
    inputs.emplace_back(t1);
  }

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs(inputs);
  testValidate(&fusion, outputs, inputs, __LINE__, __FILE__);
}
```
```
Segmented_Fusion{
groups:
  no_op{6}
  reduction{0, 1, 2, 3}
  transpose{4, 5}
edges:
  e{ reduction{0, 1, 2, 3} -> transpose{4, 5}(T4_g_float[iS8{i0}, bS9{1}]) }
  e{ reduction{0, 1, 2, 3} -> transpose{4, 5}(T2_g_float[iS4{i0}, iS5{i2}]) }
```

As shown above, the normalization scheduler is not used, which is because the size of the buffer in DataType::Float is too large. We would expect it should use the bfloat16 input as the persistent buffer, but that doesn't happen in this case since the cast op is forwarded and thus hidden from the scheduler.

This isn't an issue if the segmentation step is avoided. If `SEG=1` is omitted, the fusion is indeed scheduled as a inner persistent kernel without segmentation.

This seems like a fundamental issue with the forwarding approach since it hides actual ops from the schedulers.

The privatization approach introduced in #3776 should not have this problem and should be able to provide the same benefits if extended.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Consider privatization instead of forwarding in fusion segmentation #3832

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development