Draft: Initial CUDA C++ Execution Model documentation #3873

gonzalobg · 2025-02-20T12:57:59Z

Description

Initial documentation for the CUDA C++ Execution Model. We can expand this overtime but we need to start somewhere.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

docs/libcudacxx/extended_api/execution_model.rst

mhoemmen

Thanks Gonzalo! I left some comments.

docs/libcudacxx/extended_api/execution_model.rst

mhoemmen · 2025-02-20T21:42:40Z

docs/libcudacxx/extended_api/execution_model.rst

+            while (flag.load() == 0) {
+                (void)cudaStreamQuery(0);
+            }


Is this a bit like driving MPI progress?

I can imagine that is very similar at least, does the MPI standard talk about this somewhere? cc @jeffhammond (looking at how they explain it may be useful)

MPI 4.1 Standard (Section 2.9, "Progress") says:

All MPI processes are required to guarantee progress, i.e., all decoupled MPI activities will eventually be executed. This guarantee is required to be provided during

blocked MPI procedures, and

repeatedly called MPI test procedures (see below) that return flag=false.

Based on that, I think it works similarly to your use of cudaStreamQuery.

Section 2.9 defines "decoupled MPI activities" as follows.

Within each MPI process parts of the communication or parallel I/O pattern are executed within the MPI procedure calls that belong to the operation in that MPI process, whereas other parts are decoupled MPI activities, i.e., they may be executed within an additional progress thread, offloaded to the network interface controller (NIC), or executed within other MPI procedure calls that are not semantically related to the given communication or parallel I/O pattern.

It defines "blocked MPI procedure" as follows.

An MPI procedure invocation is blocked if it delays its return until some specific activity or state-change has occurred in another MPI process.

Section 2.9 further distinguishes "strong progress" from "weak progress."

Strong progress is provided by an MPI implementation if all local procedures return independently of MPI procedure calls in other MPI processes (operation-related or not). An MPI implementation provides weak progress if it does not provide strong progress.

It defines "local" and "nonlocal" procedure calls as follows.

An MPI procedure call that is blocked can be

a nonlocal MPI procedure call that delays its return until a specific semantically-related MPI call on another MPI process, or

a local MPI procedure call that delays its return until some unspecific MPI call in another MPI process causes a specific state-change in that other MPI process, or

an MPI finalization procedure (MPI_FINALIZE or MPI_SESSION_FINALIZE) that delays its return or exit because this MPI finalization must guarantee that all decoupled MPI activities that are related to that MPI finalization call in the calling MPI process will be executed before this MPI finalization is finished....

docs/libcudacxx/extended_api/execution_model.rst

mhoemmen · 2025-02-20T21:46:52Z

docs/libcudacxx/extended_api/execution_model.rst

+Stream and event ordering
+-------------------------
+
+A device-thread shall not make progress if it is dependent on termination of one or more unterminated device-threads or tasks via CUDA streams and/or events.


Regarding "shall not make progress," does that mean "it definitely will not" or "it might not but it might"? The example's comments suggest the latter -- that is, whether or not it makes progress depends on scheduling order, which is unspecified.

It means we guarantee it does not make progress. In the example "Execution.Model.Stream.1" below with two kernels on the same stream, this sentence guarantees that no thread of the second kernel makes progress until all threads from the first kernel terminate.

That answers my question -- thanks Gonzalo! : - )

ericniebler · 2025-02-20T23:58:23Z

docs/libcudacxx/extended_api/execution_model.rst

+
+[Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device thread on another stream or process. - end note.]
+
+[Note: A simple but not sufficient method to test workloads for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1`` - end note.]


if it is insufficient, why mention it here?

How do I test my program conforms to forward progress is a frequently-asked question by users.

Testing as suggested is our recommended way to do that. While insufficient, it will catch many/most issues, and is the only "tool" we provide for this. It is therefore worth documenting somewhere, and for now, this document is the only place in our entire documentation in which we talk about this topic, so "here" seemed better than "nowhere".

If we eventually develop high-level user-documentation for any of this, we should probably expand on this there.

Does the following make it clearer?

Suggested change

[Note: A simple but not sufficient method to test workloads for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1`` - end note.]

[Note: A simple but not sufficient method to test a program for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1``, and then check that the program still terminates.

If it does not, the program has a bug.

This method is not sufficient because it does not catch all Forward Progress bugs, but it does catch many such bugs. - end note.]

docs/libcudacxx/extended_api/execution_model.rst

ericniebler · 2025-02-21T00:05:32Z

docs/libcudacxx/extended_api/execution_model.rst

+        cuda::atomic<int, cuda::thread_scope_system> flag = 0;
+        __global__ void producer() { flag.store(1); }
+        int main() {


some blank lines in these examples would make them more readable

ericniebler · 2025-02-21T00:08:48Z

docs/libcudacxx/extended_api/execution_model.rst

+Stream and event ordering
+-------------------------
+
+A device-thread shall not make progress if it is dependent on termination of one or more unterminated device-threads or tasks via CUDA streams and/or events.


what is a "task" here?

CUDA doesn't currently have a definition for "operations" on a stream. We call them tasks here and in some other parts of our documentation, but we don't define that anywhere. Some other parts of the documentation call them "Commands" as clarified in the note below.

We should eventually properly define that somewhere, e.g., in the CUDA Driver/Runtime documentation, and then just update this here to use the right term and reference that.

See #3873 (comment) for a suggestion, let me know if that resolves this.

docs/libcudacxx/extended_api/execution_model.rst

Co-authored-by: Mark Hoemmen <[email protected]>

docs/libcudacxx/extended_api/execution_model.rst

gonzalobg · 2025-02-21T11:42:18Z

docs/libcudacxx/extended_api/execution_model.rst

+A device-thread shall not make progress if it is dependent on termination of one or more unterminated device-threads or tasks via CUDA streams and/or events.
+
+[Note: This excludes dependencies such as Programmatic Dependent Launch or Launch Completion which do not encompass termination of the dependency. - end note.]
+
+[Note: Tasks are also referred to as `Commands <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams>`__. - end note.]


@ericniebler Does this resolve https://github.com/NVIDIA/cccl/pull/3873/files#r1964536079 and make it clearer?

Suggested change

A device-thread shall not make progress if it is dependent on termination of one or more unterminated device-threads or tasks via CUDA streams and/or events.

[Note: This excludes dependencies such as Programmatic Dependent Launch or Launch Completion which do not encompass termination of the dependency. - end note.]

[Note: Tasks are also referred to as `Commands <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams>`__. - end note.]

A device thread shall not start making progress until all its dependencies have completed.

[Note: Dependencies that prevent device threads from starting to make progress can be created, for example, via CUDA Stream `Command <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams>`__s.

These may include dependencies on the completion of, among others, `CUDA Events <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#events>`__ and `CUDA Kernels <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels>`__. - end note.]

docs/libcudacxx/extended_api/execution_model.rst

Co-authored-by: Mark Hoemmen <[email protected]>

gonzalobg · 2025-02-21T18:32:56Z

pre-commit.ci autofix

gonzalobg added 3 commits February 2, 2025 17:07

Document CUDA execution model

04b0a92

Add example

1c3e1ff

Remove undefined in-progress

c4f9366

gonzalobg requested a review from a team as a code owner February 20, 2025 12:57

gonzalobg requested a review from gonidelis February 20, 2025 12:58

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Missing colon

3432fd1

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Fix typo

ab8d048

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Fix typo

6d3121e

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Fix typo

61b8dec

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Fix typo

ba2bdb5

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Fix typo

1ffdf19

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Clarify

2a9d1f6

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Fix typo

3ec07f1

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Fix typo

3e5d26f

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Better example name

b24688e

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Better example name

a5ee447

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Better example name

c32acef

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Rephrase

6e02dbf

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Typo

d4dbe7e

gonzalobg commented Feb 20, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Rephrase

ee99ba2

mhoemmen approved these changes Feb 20, 2025

View reviewed changes

ericniebler reviewed Feb 20, 2025

View reviewed changes

ericniebler reviewed Feb 21, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

ericniebler reviewed Feb 21, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

gonzalobg and others added 4 commits February 21, 2025 11:50

Fix incorrect namespace

2c7a31d

Fix typo

753c2a9

Co-authored-by: Mark Hoemmen <[email protected]>

Fix typo

4851f6a

Co-authored-by: Mark Hoemmen <[email protected]>

Fix typo

d7de9b0

gonzalobg commented Feb 21, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Fix typo

0af7452

gonzalobg commented Feb 21, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

Fix typos

41f57bb

gonzalobg commented Feb 21, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Show resolved Hide resolved

gonzalobg commented Feb 21, 2025

View reviewed changes

mhoemmen reviewed Feb 21, 2025

View reviewed changes

docs/libcudacxx/extended_api/execution_model.rst Outdated Show resolved Hide resolved

gonzalobg and others added 4 commits February 21, 2025 17:35

Clarify additions and modifications to C++ standard.

bd48ba2

Fix typo

fb3b573

Co-authored-by: Mark Hoemmen <[email protected]>

Rephrase note

39fc26e

Co-authored-by: Mark Hoemmen <[email protected]>

Rephrase

3fbab9d

Co-authored-by: Mark Hoemmen <[email protected]>

pre-commit-ci bot and others added 2 commits February 21, 2025 18:35

[pre-commit.ci] auto code formatting

ecae57d

Merge branch 'main' into execution_model

1f90b6c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Initial CUDA C++ Execution Model documentation #3873

Draft: Initial CUDA C++ Execution Model documentation #3873

gonzalobg commented Feb 20, 2025 •

edited

Loading

mhoemmen left a comment

mhoemmen Feb 20, 2025

gonzalobg Feb 21, 2025 •

edited

Loading

mhoemmen Feb 21, 2025 •

edited

Loading

mhoemmen Feb 20, 2025

gonzalobg Feb 21, 2025

mhoemmen Feb 21, 2025

ericniebler Feb 20, 2025

gonzalobg Feb 21, 2025

gonzalobg Feb 21, 2025 •

edited

Loading

ericniebler Feb 21, 2025

ericniebler Feb 21, 2025

gonzalobg Feb 21, 2025 •

edited

Loading

gonzalobg Feb 21, 2025

gonzalobg Feb 21, 2025 •

edited

Loading

gonzalobg commented Feb 21, 2025


		[Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device thread on another stream or process. - end note.]

		[Note: A simple but not sufficient method to test workloads for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1`` - end note.]

-[Note: A simple but not sufficient method to test workloads for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1`` - end note.]
+[Note: A simple but not sufficient method to test a program for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1``, and then check that the program still terminates.
+If it does not, the program has a bug.
+This method is not sufficient because it does not catch all Forward Progress bugs, but it does catch many such bugs. - end note.]

Draft: Initial CUDA C++ Execution Model documentation #3873

Are you sure you want to change the base?

Draft: Initial CUDA C++ Execution Model documentation #3873

Conversation

gonzalobg commented Feb 20, 2025 • edited Loading

Description

Checklist

mhoemmen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gonzalobg Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

mhoemmen Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gonzalobg Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gonzalobg Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gonzalobg Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

gonzalobg commented Feb 21, 2025

gonzalobg commented Feb 20, 2025 •

edited

Loading

gonzalobg Feb 21, 2025 •

edited

Loading

mhoemmen Feb 21, 2025 •

edited

Loading

gonzalobg Feb 21, 2025 •

edited

Loading

gonzalobg Feb 21, 2025 •

edited

Loading

gonzalobg Feb 21, 2025 •

edited

Loading