Skip to content

Error with driver API's lazy load of cuStream ops #3907

Open
@samnordmann

Description

related Team's thread

I am trying to use cuStreamWriteValue32 which is part of the cuda driver API (context: #3894). Even though I can build, I am getting a runtime error CUDA_ERROR_NOT_SUPPORTED. This should be supported as I am using a DGX H100 node with cuda 12.8, inside the pjnl latest docker.
Repro:

The source of problem can be narrowed down to lazy loading /usr/local/cuda/compat/lib.real/libcuda.so.1 in the pjnl container -- the bug comes either from lazy loading or from the library itself.

To prove this, note that the following patch (which explicitly links to cuda, non-lazily) solves the bug:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9d7d7b32..3e51bce8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -334,6 +334,7 @@ target_link_libraries(codegen_internal PUBLIC
   ${LIBCUPTI}
   ${TORCH_LIBRARIES}
   dl
+  cuda
)

add_library(nvfuser_codegen SHARED $<TARGET_OBJECTS:codegen_internal>)
diff --git a/csrc/driver_api.h b/csrc/driver_api.h
index 41072a22..b8c413a4 100644
--- a/csrc/driver_api.h
+++ b/csrc/driver_api.h
@@ -37,7 +37,6 @@ namespace nvfuser {
#if (CUDA_VERSION >= 12000)
#define ALL_DRIVER_API_WRAPPER(fn)   \
   ALL_DRIVER_API_WRAPPER_CUDA11(fn); \
-  fn(cuStreamWriteValue32);          \
   fn(cuTensorMapEncodeTiled)
#else
#define ALL_DRIVER_API_WRAPPER ALL_DRIVER_API_WRAPPER_CUDA11
diff --git a/tests/cpp/test_gpu3.cpp b/tests/cpp/test_gpu3.cpp
index 9570bb9b..a7236c6a 100644
--- a/tests/cpp/test_gpu3.cpp
+++ b/tests/cpp/test_gpu3.cpp
@@ -56,7 +56,8 @@
#include <sstream>
#include "parallel_dimension_map.h"

-#include <driver_api.h>
+// #include <driver_api.h>
+#include <cuda.h>
namespace nvfuser {

using namespace at::indexing;

and note also that cuda-gdb gives the following backtrace of the error:

#0  0x00007fff37f740f0 in cudbgReportDriverApiError () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#1  0x00007fff381e312b in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#2  0x00007fff2f4c0d47 in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#3  0x00007fff2f49c29e in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#4  0x00007fff2f4af56d in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#5  0x00007fff2f5aebd6 in ?? () from /usr/local/cuda/compat/lib.real/libcudadebugger.so.1
#6  0x00007fff380c05d0 in ?? () from /usr/local/cuda/compat/lib.real/libcuda.so.1
#7  0x0000555555a67b3e in lazilyLoadAndInvoke (args#0=0x7fff2ad0d618, args#1=140724802682880, args#2=3, args#3=0) at /opt/pytorch/Fuser2/csrc/driver_api.cpp:95

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions