-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Initial CUDA C++ Execution Model documentation #3873
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Gonzalo! I left some comments.
while (flag.load() == 0) { | ||
(void)cudaStreamQuery(0); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bit like driving MPI progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine that is very similar at least, does the MPI standard talk about this somewhere? cc @jeffhammond (looking at how they explain it may be useful)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPI 4.1 Standard (Section 2.9, "Progress") says:
All MPI processes are required to guarantee progress, i.e., all decoupled MPI activities will eventually be executed. This guarantee is required to be provided during
- blocked MPI procedures, and
- repeatedly called MPI test procedures (see below) that return
flag=false
.
Based on that, I think it works similarly to your use of cudaStreamQuery
.
Section 2.9 defines "decoupled MPI activities" as follows.
Within each MPI process parts of the communication or parallel I/O pattern are executed within the MPI procedure calls that belong to the operation in that MPI process, whereas other parts are decoupled MPI activities, i.e., they may be executed within an additional progress thread, offloaded to the network interface controller (NIC), or executed within other MPI procedure calls that are not semantically related to the given communication or parallel I/O pattern.
It defines "blocked MPI procedure" as follows.
An MPI procedure invocation is blocked if it delays its return until some specific activity or state-change has occurred in another MPI process.
Section 2.9 further distinguishes "strong progress" from "weak progress."
Strong progress is provided by an MPI implementation if all local procedures return independently of MPI procedure calls in other MPI processes (operation-related or not). An MPI implementation provides weak progress if it does not provide strong progress.
It defines "local" and "nonlocal" procedure calls as follows.
An MPI procedure call that is blocked can be
- a nonlocal MPI procedure call that delays its return until a specific semantically-related MPI call on another MPI process, or
- a local MPI procedure call that delays its return until some unspecific MPI call in another MPI process causes a specific state-change in that other MPI process, or
- an MPI finalization procedure (
MPI_FINALIZE
orMPI_SESSION_FINALIZE
) that delays its return or exit because this MPI finalization must guarantee that all decoupled MPI activities that are related to that MPI finalization call in the calling MPI process will be executed before this MPI finalization is finished....
Stream and event ordering | ||
------------------------- | ||
|
||
A device-thread shall not make progress if it is dependent on termination of one or more unterminated device-threads or tasks via CUDA streams and/or events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding "shall not make progress," does that mean "it definitely will not" or "it might not but it might"? The example's comments suggest the latter -- that is, whether or not it makes progress depends on scheduling order, which is unspecified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means we guarantee it does not make progress. In the example "Execution.Model.Stream.1" below with two kernels on the same stream, this sentence guarantees that no thread of the second kernel makes progress until all threads from the first kernel terminate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That answers my question -- thanks Gonzalo! : - )
|
||
[Note: The device thread need not be "related" to the API call, e.g., an API operating on one stream or process may ensure progress of a device thread on another stream or process. - end note.] | ||
|
||
[Note: A simple but not sufficient method to test workloads for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1`` - end note.] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is insufficient, why mention it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I test my program conforms to forward progress is a frequently-asked question by users.
Testing as suggested is our recommended way to do that. While insufficient, it will catch many/most issues, and is the only "tool" we provide for this. It is therefore worth documenting somewhere, and for now, this document is the only place in our entire documentation in which we talk about this topic, so "here" seemed better than "nowhere".
If we eventually develop high-level user-documentation for any of this, we should probably expand on this there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the following make it clearer?
[Note: A simple but not sufficient method to test workloads for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1`` - end note.] | |
[Note: A simple but not sufficient method to test a program for CUDA API Forward Progress conformance is to run them with following environment variables set: ``CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1``, and then check that the program still terminates. | |
If it does not, the program has a bug. | |
This method is not sufficient because it does not catch all Forward Progress bugs, but it does catch many such bugs. - end note.] |
cuda::atomic<int, cuda::thread_scope_system> flag = 0; | ||
__global__ void producer() { flag.store(1); } | ||
int main() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some blank lines in these examples would make them more readable
Stream and event ordering | ||
------------------------- | ||
|
||
A device-thread shall not make progress if it is dependent on termination of one or more unterminated device-threads or tasks via CUDA streams and/or events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a "task" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA doesn't currently have a definition for "operations" on a stream. We call them tasks here and in some other parts of our documentation, but we don't define that anywhere. Some other parts of the documentation call them "Commands" as clarified in the note below.
We should eventually properly define that somewhere, e.g., in the CUDA Driver/Runtime documentation, and then just update this here to use the right term and reference that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #3873 (comment) for a suggestion, let me know if that resolves this.
Co-authored-by: Mark Hoemmen <[email protected]>
Co-authored-by: Mark Hoemmen <[email protected]>
A device-thread shall not make progress if it is dependent on termination of one or more unterminated device-threads or tasks via CUDA streams and/or events. | ||
|
||
[Note: This excludes dependencies such as Programmatic Dependent Launch or Launch Completion which do not encompass termination of the dependency. - end note.] | ||
|
||
[Note: Tasks are also referred to as `Commands <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams>`__. - end note.] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericniebler Does this resolve https://github.com/NVIDIA/cccl/pull/3873/files#r1964536079 and make it clearer?
A device-thread shall not make progress if it is dependent on termination of one or more unterminated device-threads or tasks via CUDA streams and/or events. | |
[Note: This excludes dependencies such as Programmatic Dependent Launch or Launch Completion which do not encompass termination of the dependency. - end note.] | |
[Note: Tasks are also referred to as `Commands <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams>`__. - end note.] | |
A device thread shall not start making progress until all its dependencies have completed. | |
[Note: Dependencies that prevent device threads from starting to make progress can be created, for example, via CUDA Stream `Command <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams>`__s. | |
These may include dependencies on the completion of, among others, `CUDA Events <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#events>`__ and `CUDA Kernels <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels>`__. - end note.] |
Co-authored-by: Mark Hoemmen <[email protected]>
Co-authored-by: Mark Hoemmen <[email protected]>
Co-authored-by: Mark Hoemmen <[email protected]>
pre-commit.ci autofix |
Description
Initial documentation for the CUDA C++ Execution Model. We can expand this overtime but we need to start somewhere.
Checklist