This is a header-only library of integrated wrappers around the core parts of NVIDIA’s CUDA execution ecosystem:
It is intended for those who would otherwise use these APIs directly, to make working with them be more intuitive and consistent, making use of modern C++ language capabilities, programming idioms and best practices. In a nutshell - making CUDA API work more fun :-)
Also, and importantly - while the wrappers seem to be “high-level”, more “abstract” code - they are nothing more than a modern-C++-aesthetic arrangement of NVIDIA’s own APIs. The wrapper library does not force any abstractions above CUDA’s own, nor conventions regarding where to place data, how and when to perform synchronization, etc.; you have the complete range of expression of the underlying APIs.
In contrast to the above, this library provides:
There is one noteworthy caveat: The wrapper API calls cannot make assumptions about previous or later code of yours, which means some of them require more calls to obtain the current context handle or push a(n existing) context, then pop it. While these calls are cheap, they are still non-trivial and can’t be optimized away.
NVIDIA provides two main APIs for using CUDA: The Runtime API and the Driver API. These suffer from several deficiencies:
You may have noticed this list reads like the opposite of the key features, listed above: The idea is to make this library overcome and rectify all of these deficiencies as much as possible.
CUDA: v11.x or later recommended, v9.0 or later supported.
Remember that an NVIDIA driver compatible with your CUDA version also needs to be installed. Typically, this can be the one bundled in your CUDA distribution itself.
For CMake, you have three alternatives for obtaining the library to use in your project:
zlib
, use cuda-api-wrappers
with the version you’re interested in. Briefly, you will need to create an appropriate conanfile.txt
file; make sure you have a Conan profile; and use conan install
to target your project’s build directory. This will create a CMake toolchain file with which you actually build your project.vcpkg install cuda-api-wrappers
FetchContent
module to have CMake itself obtain the project source code and make it part of your own project’s build, e.g.:
include(FetchContent)
FetchContent_Declare(cuda-api-wrappers_library
GIT_REPOSITORY https://github.com/eyalroz/cuda-api-wrappers.git
GIT_TAG v12.34.56 # Replace this with a real available version
OVERRIDE_FIND_PACKAGE
)
Now that you have the package, in your project’s CMakeLists.txt
, you write:
find_package(cuda-api-wrappers CONFIG REQUIRED)
This will let you use three targets within the cuda-api-wrappers::
namespace: runtime-and-driver
, nvrtc
and nvtx
. For example:
target_link_library(my_app cuda-api-wrappers::runtime-and-driver)
Use not involving CMake:
Since this is a header-only library, you can simply add the src/
subdirectory as one of your project’s include directories. However, if you do this, it will be up to you to make sure and have the CUDA include directory in you include path as well, and to link against the CUDA driver, runtime API, NVRTC and/or NVTX libraries as appropriate.
Most, but not quite all, API calls in the Runtime, Driver, NVTX and NVRTC are covered by these wrappers. You can find the main omissions as issues tagged with “missing-cuda-feature”, intended for further development work. Additionally, the wrapper library does not cover the APIs for interoperability with OpenGL, Direct3D, EGL and VDAPU.
The Milestones indicates some features which aren’t covered and are slated for future work. Since I am not currently working on anything graphics-related, there are no short-term plans to extend coverage to more graphics-related APIs; however - PRs are welcome.
Don’t you wish you were able to type in, say:
auto callback = [&foo] { std::cout << "Hello " << foo << " world!\n"; }
my_stream.enqueue.host_invokable(callback);
… and have that just work? And don’t you wish CUDA API invocations could be plain and straightforward one-liners? Like this sequence?
auto compilation_output = my_program.compile();
auto module = cuda::module::create(context, compilation_output);
auto my_kernel = module.get_kernel(mangled_kernel_name);
auto gpu_side_buffer = cuda::memory::make_unique_span<float>(device, n);
cuda::memory::copy(gpu_side_buffer, host_side_buffer);
auto launch_config = cuda::launch_config_builder().overall_size(n).block_size(256).build();
cuda::launch(my_kernel, launch_config, gpu_side_buffer.data());
cuda::memory::copy(host_side_buffer, gpu_side_buffer);
Well, now all of the above is valid code! :-)
To be a little more thorough than just an anecdote, let’s relate back to some of the design principles listed above:
With this library, you would do cuda::memory::host::allocate(num_bytes)
instead of calling cudaMallocHost()
or cuMemAllocHost()
; and if you want to allocate global device memory, it would be my_device.memory.allocate(num_bytes)
(or cuda::device::get(device_id).memory.allocate(num_bytes)
. Or it could be a stand-alone function as well, i.e. cuda::memory::device::allocate(my_device, num_bytes)
. The compiled version of these calls will be nothing but the sequence of API calls: cuInit()
, cuDevicePrimaryCtxRetain()
, cuCtxPushCurrent()
, cuMemAlloc()
etc.
The expression
my_device.compute_capability() >= cuda::make_compute_capability(60)
is a valid comparison, true for all devices with a Pascal-or-later micro-architecture. This, despite the fact that struct cuda::compute_capability_t
is a POD type with two unsigned integer fields, not a scalar.
Instead of using
cudaError_t cudaEventCreateWithFlags(
cudaEvent_t* event,
unsigned int flags)
which requires you remember what you need to specify as flags and how, you create a cuda::event_t
proxy object, using the function:
cuda::event_t cuda::event::create(
cuda::device_t device,
bool uses_blocking_sync,
bool records_timing = cuda::event::do_record_timing,
bool interprocess = cuda::event::not_interprocess)
The default values here are enum : bool
’s, which you can use yourself when creating non-default-parameter events - to make the call more easily readable than with true
or false
.
In lieu of a full-fledged user’s guide, I’m providing several kinds of example programs; browsing their source you’ll know most of what there is to know about the API wrappers. To build and run the examples (just as a sanity check), execute the following (in a Unix-style command shell):
cmake -S . -B build -DCAW_BUILD_EXAMPLES=ON .
cmake --build build/
find build/examples/bin -type f -executable -exec "{}" ";"
The two main kinds of example programs are:
The CUDA distribution contains sample programs demostrating various features and concepts. A few of these - which are not focused on device-side work - have been adapted to use the API wrappers - completely foregoing direct use of the CUDA Runtime API itself. You will find them in the modified CUDA samples example programs folder.
Gradually, an example program is being added for each one of the CUDA Runtime API Modules, in which the approach replacing use of those module API calls by use of the API wrappers is demonstrated. These per-module example programs can be found here.