Skip to content

Feature request: Consider privatization instead of forwarding in fusion segmentation #3832

Open
@naoyam

Description

Input forwarding helps segmentation avoid unnecessary intermediate tensors, but it also hides those forwarded ops from the segmenter and schedulers, which can result in performance issues.

For example, if a cast op from bf16 to fp32 is forwarded, the normalization schedulers would need to assume a 2x larger persistent buffer because it wouldn't do the input projection.

Here's a repro with the inner normalization scheduler. Run this as SEG=1 S=$((64 * 1024)) NVFUSER_DUMP=segmented_fusion ./bin/nvfuser_tests --gtest_filter='*FowardingMiss*. The S parameter needs to be adjusted depending on the actual GPU. Choose the size that would fit in the shared memory if the data type is bfloat16 but not with float, which is 64K with RTX 6000.

TEST_F(SegmentationTest, FowardingMissProjectionToLowerPrecisionInput) {
  std::unique_ptr<Fusion> fusion_ptr = std::make_unique<Fusion>();
  Fusion& fusion = *fusion_ptr.get();
  FusionGuard fg(&fusion);

  auto tv0 = makeSymbolicTensor(2, DataType::BFloat16);
  fusion.addInput(tv0);

  auto tv1 = castOp(DataType::Float, tv0);
  auto tv2 = set(tv1);
  auto tv3 = sum(tv2, {1});
  auto tv4 = broadcast(tv3, {false, true});
  auto tv5 = add(tv2, tv4);
  auto tv6 = castOp(DataType::BFloat16, tv5);
  fusion.addOutput(tv6);

  // Forces segmentation
  if (getenv("SEG")) {
    auto tv7 = makeSymbolicTensor(1, DataType::BFloat16);
    fusion.addInput(tv7);
    fusion.addOutput(segment_set(tv7));
  }

  int64_t size = atoi(getenv("S"));
  auto options = at::TensorOptions().dtype(at::kBFloat16).device(at::kCUDA, 0);
  at::Tensor t0 = at::randn({128, size}, options);
  std::vector<c10::IValue> inputs = {t0};
  if (getenv("SEG")) {
    at::Tensor t1 = at::randn({10}, options);
    inputs.emplace_back(t1);
  }

  FusionExecutorCache executor_cache(std::move(fusion_ptr));
  auto outputs = executor_cache.runFusionWithInputs(inputs);
  testValidate(&fusion, outputs, inputs, __LINE__, __FILE__);
}
Segmented_Fusion{
groups:
  no_op{6}
  reduction{0, 1, 2, 3}
  transpose{4, 5}
edges:
  e{ reduction{0, 1, 2, 3} -> transpose{4, 5}(T4_g_float[iS8{i0}, bS9{1}]) }
  e{ reduction{0, 1, 2, 3} -> transpose{4, 5}(T2_g_float[iS4{i0}, iS5{i2}]) }

As shown above, the normalization scheduler is not used, which is because the size of the buffer in DataType::Float is too large. We would expect it should use the bfloat16 input as the persistent buffer, but that doesn't happen in this case since the cast op is forwarded and thus hidden from the scheduler.

This isn't an issue if the segmentation step is avoided. If SEG=1 is omitted, the fusion is indeed scheduled as a inner persistent kernel without segmentation.

This seems like a fundamental issue with the forwarding approach since it hides actual ops from the schedulers.

The privatization approach introduced in #3776 should not have this problem and should be able to provide the same benefits if extended.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    SegmentationIssues related to nvFuser Segmentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions