Jakub Homola
Jakub Homola

Reputation: 117

How to design a library which uses CUDA only in its one part so that other parts also work without CUDA installed?

Let's say we are developing a C++ library with several functions implementing several operations on some data, eg. SumArray, SquareElements, AddVectors. This is compiled into a C++ library and can be used from another program fine.

Then we add a function MatrixMultiply. Because this is a perfect target for GPU-acceleration, we also add a function MatrixMultiplyCuda, which internally calls some CUDA kernel.

So now the whole library requires CUDA, even if the user of the library never uses the MatrixMultiplyCuda function.

So, the question: Is there a way to make the updated library functional even on a system without CUDA? Is there any library which deals with similar problem? Obviously, the MatrixMultiplyCuda function would not work without CUDA, which is fine.

My current solution is to have a macro MYLIB_USE_CUDA guarding all the CUDA-specific code and functions, so that they are used only when the macro MYLIB_USE_CUDA is defined, and the code is excluded if the macro is not defined. I compile the library using CMake, and if the flag -DMYLIB_USE_CUDA is passed to CMake, the macro is defined during the compilation process and the CUDA libraries are linked.

I, however, don't really like this solution, because if the library is used in other code, the macro MYLIB_USE_CUDA still has to be defined (because of header files) if the CUDA-specific functions are to be used, complicating the use of the library.

This does not only have to be CUDA, the issue is the same with any other library, but when the library is small it does not matter. People don't want to install several-gigabyte of CUDA because of my library, if they are not even going to use the CUDA functionality.

Upvotes: 1

Views: 942

Answers (3)

Jakub Homola
Jakub Homola

Reputation: 117

So, after some thinking and discussion with colleagues, this is the way I solved it (posting a simplified version, the actual code is much more complicated).

mymath.h:

#pragma once
#include "MatrixMultiply.h"
// ...

#ifdef MYMATH_USE_CUDA
#include "MatrixMultiplyCUDA.h"
#endif

mymath_cuda.h:

#pragma once
#ifndef MYMATH_USE_CUDA
#define MYMATH_USE_CUDA
#endif

#include "mymath.h"

In the MatrixMultiplyCUDA.h header there is the declarationi of the functions that use CUDA for doing the operation, in MatrixMultiplyCUDA.cu is the implementation. Similar for MatrixMultiply.h for the CPU-only version.

If one wants to use the classic CPU-only part of the library, they #include "mymath.h". If one wants to use the additional GPU-accelerated functions which use CUDA, they #include "mymath_cuda.h".

Then, CMakeLists.txt:

cmake_minimum_required(VERSION 3.18)
project(mymath)

set(MYMATH_USE_CUDA OFF)
find_package(CUDA QUIET)
if(CUDA_FOUND)
    set(MYMATH_USE_CUDA ON)
    enable_language(CUDA)
endif()

set(mymath_SOURCES src/MatrixMultiply.cpp src/otherstuff.cpp)
set(mymath_cuda_SOURCES src/MatrixMultiplyCUDA.cu)

add_library(mymath STATIC ${mymath_SOURCES})
target_include_directories(mymath PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include)

if(MYMATH_USE_CUDA)
    add_library(mymath_cuda STATIC ${mymath_cuda_SOURCES})
    target_include_directories(mymath_cuda PUBLIC ${CMAKE_CURRENT_SOURCE_DIR}/include)
    target_include_directories(mymath_cuda SYSTEM PUBLIC ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})
    target_link_libraries(mymath_cuda PUBLIC mymath ${CUDA_LIBRARIES})
    target_compile_definitions(mymath_cuda PUBLIC MYMATH_USE_CUDA)
endif()

If CUDA is found, the mymath_cuda library is added to compilation with all its dependencies, mymath included.

If the user only uses mymath library, it is sufficent to link only with it, but if the CUDA-accelerated functions are used, both libraries mymath and mymath_cuda need to be linked.

Basically it is another library extending the functionality of the first library, but the header files are interconnected together. They would not have to be, the mymath_cuda.h could include all the cuda-specific headers itself and not rely on the mymath.h and the #ifdefs, but this was the design choice we made.

Again, this is not the actual code, so it might contain typos or be partially incomplete/incorrect, but the basic principle is hopefully visible.

Upvotes: 1

einpoklum
einpoklum

Reputation: 132260

You can do this by moving all of the dynamic linking and symbol lookup work from link time to run time. Now, this is quite platform dependent, but on POSIX systems, the following works:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <dlfcn.h>
#include <cuda_runtime.h>

int main() {
    typedef cudaError_t (*cudaMalloc_t) ( void** devPtr, size_t size );

    const char* cuda_rt_dll_filename = find_the_cuda_so_somehow();
        // e.g. on my system it's 
        // "/usr/local/cuda-11.4.1/targets/x86_64-linux/lib/libcudart.so" 
    void *cuda_rt_dll = dlopen(cuda_rt_dll_filename , RTLD_NOW);
    if (cuda_rt_dll == NULL) {
        fprintf(stderr, "Failed opening %s\n", cuda_rt_dll_filename);
        exit(EXIT_FAILURE);
    }
    cudaMalloc_t cudaMalloc_ = dlsym(cuda_rt_dll, "cudaMalloc");
    void* device_buffer;
    cudaError_t ret = cudaMalloc_(&device_buffer, 1000);

    // error-checking of ret here

    // Do stuff with the device buffer,
    // e.g. with more dynamically-loaded functions 

    dlclose(cuda_rt_dll);
}

Obviously you can do this in another function rather than in main(), and then the rest of your program can be completely CUDA-oblivious; and instead of exiting the program, you can return an error code indicating the failure.

Something else you could do is write stubs for each of the CUDA functions you use, visible only outside the object which does the dynamic loading, and with the same names of the original CUDA function names. The stub for CUDA API function f would do the following

  • Check whether the program has already loaded f dynamically (into some f_ function pointer).
  • If not, try to load it.
  • If loading has failed, give up and return some CUDA failure code.
  • If loading has succeeded, invoke f_ with the arguments supplied to f and return the result.

PS:

  • Regarding run-time loading of shared objects/DLLs, see also: Get DLL path at runtime
  • Remember that CUDA's runtime API has some C++ functions as well. To load those, you need to consider name mangling (see here or on Wikipedia about that).

Upvotes: 0

pptaszni
pptaszni

Reputation: 8335

You don't need a macro for that. With CMake (or with whatever other tool you might be using) you can simply do

find_package(CUDA QUIET)  // detect is CUDA is installed in the system
if(CUDA_FOUND)
  // add MatrixMultiplyCuda.cpp to your list of sources
else()
  // add MatrixMultiplNormal.cpp to your list of sources
endif()

and in your code use a common header (let's say MatrixMultiply.hpp) with matrix functions signatures independent on CUDA (e.g. Matrix multiply(const Matrix& lhs, const Matrix& rhs);). Depending on the source file being compiled your library will use one implementation or the other.

Upvotes: 1

Related Questions