In Cmake, CUDA nvlink error: Undefined reference to '_Z15f' while trying to link to a __device__ void foo{} function

Question

Systems:

Windows 11 Visual Studio 17 2022 Cuda 12.5 Gpu: RTX 3070 GPU Driver: (need to update later)

Linux Ubuntu 22.04.5 LTS gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 CUDA Version: 12.2 GPU: RTX a6000 GPU Driver: 535.183.01

Docker Env System: Linux Ubuntu 20.04.6 LTS gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0 CUDA Version: 12.2 GPU: RTX a6000 GPU Driver: 535.183.01

Initial Question:

Are device functions __device__ void foo(){} meant to be part of a public facing API? Meaning that if I wanted to ship a .so/.dll file, are __device__ functions capable of being part of that API or is this not an intended use?

A lot of the attempts at trying to solve my issue revolve around the decoupling of translation units/files of code in different folders. Thus, leading me to believe there might be other "structural" behaviors that I am not aware of. I've tried many methods to solve my error below, but I am a relatively new CUDA developer so I wanted to get past this sanity check first.

If it is possible to have __device__ kernels as part of my API and in general link to them throughout my project/ship them as part of a .dll/.so file, then I would like some help with the following:

Context:

I am having issues with properly linking to a __device__ kernel function from a cmake library. This issue originally stems from trying to test a __device__ function in a Google Test Frame work, but the issue has been simplified to just trying to link to a __device__ function.

The code might not make a lot of sense due to the number of things I've tried. Originally the kernels were __device__ only, but I added the wrappers as a work around.

Linux Error:

nvlink error   : Undefined reference to '_Z15device_functionf' in 'CMakeFiles/my_cuda_app.dir/main.cu.o' (target: sm_80)
gmake[2]: *** [apps/CMakeFiles/my_cuda_app.dir/build.make:98: apps/CMakeFiles/my_cuda_app.dir/cmake_device_link.o] Error 255
gmake[1]: *** [CMakeFiles/Makefile2:142: apps/CMakeFiles/my_cuda_app.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2

Windows Error:

CUDALINK : nvlink error : Undefined reference to '_Z15device_functionf' in 'my_cuda_app.dir/Debug/main.obj' [C:\Users\chang\source
epos\simple-cuda-dev\build\apps\my_cuda_app.vcxproj]
...
C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.5.targets(908,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12
.5\bin
vcc.exe" -dlink  -o my_cuda_app.dir\Debug\my_cuda_app.device-link.obj -Xcompiler "/EHsc /W1 /nologo /Od /Zi /RTC1 /MDd /GR" -Xcompiler "/Fdmy_cuda_app.dir\Debug\vc143.pdb" -L"C:\Program Files\NVIDIA GPU
Computing Toolkit\CUDA\v12.5\bin/crt" -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\lib\x64" ..\src\Debug\my_cuda_lib.lib cudadevrt.lib cudart_static.lib kernel32.lib user32.lib gdi32.lib winspool.
lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib -forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -D_WINDOWS -Xcompiler=" /GR /EHsc" -Xcompiler=" -Zi -Ob0 -Od /RTC1" "--gene
rate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-MDd    my_cuda_app.dir\Debug\main.obj" exited with code 255. [C:\Users\chang\source
epos\simple-cuda-dev\build\apps\my_cuda_app.vcxproj]

Project Structure

simple-cuda-dev/
├── CMakeLists.txt          # Root CMake file
├── include/
│   └── my_cuda_code_defs.h # Header file for device function declarations
├── src/
│   ├── CMakeLists.txt      # CMake file for building shared library
│   └── my_cuda_code.cu     # Source file containing device functions
├── apps/
│   ├── CMakeLists.txt      # CMake file for building application
│   └── main.cu             # Application source file
└── build/                  # Build directory (generated by CMake)

Code:

CMakeLists.txt

cmake_minimum_required(VERSION 3.18 FATAL_ERROR)

project(my_cuda_project LANGUAGES CXX CUDA)

# Set general compiler flags
if(CMAKE_BUILD_TYPE MATCHES Debug)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g") # Debug symbols
else()
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -DNDEBUG") # Optimization
endif()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra") # General warnings


# Set global CUDA architecture (applies to all targets)
set(CMAKE_CUDA_ARCHITECTURES 80)

# Set global C++ standard
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)

# Add subdirectories for libraries and executables
add_subdirectory(src)
add_subdirectory(apps)

src/CMakeLists.txt

# src/CMakeLists.txt
add_library(my_cuda_lib SHARED)  # Create the library (no source files yet)

target_sources(my_cuda_lib PRIVATE my_cuda_code.cu) # Explicitly add CUDA source

# Enable separable compilation for CUDA and ensure PIC for shared libraries
set_target_properties(my_cuda_lib PROPERTIES
    CUDA_SEPARABLE_COMPILATION ON
    CUDA_RESOLVE_DEVICE_SYMBOLS ON
    POSITION_INDEPENDENT_CODE ON # Important for DLLs!
)

target_include_directories(my_cuda_lib PUBLIC ${PROJECT_SOURCE_DIR}/include)

apps/CMakeLists.txt

add_executable(my_cuda_app main.cu)

# Add include directories (private scope)
target_include_directories(my_cuda_app PRIVATE ${PROJECT_SOURCE_DIR}/include)

# Link against the library built in 'src'
target_link_libraries(my_cuda_app PRIVATE my_cuda_lib)

# Explicitly link CUDA libraries
target_link_libraries(my_cuda_app PRIVATE ${CUDA_LIBRARIES})

# Set CUDA_SEPARABLE_COMPILATION for the executable
set_target_properties(my_cuda_app PROPERTIES
    CUDA_SEPARABLE_COMPILATION ON
    CUDA_RESOLVE_DEVICE_SYMBOLS ON
)

src/my_cuda_code.cu

#include "my_cuda_defs.h"

// Device function implementation
__device__ float device_function(float x) {
    return x * x; // Simple squaring function
}

// Kernel implementation
__global__ void my_kernel(float* input, float* output) {
    *output = device_function(*input);
}

// Host function implementation (exported for DLLs)
extern "C" CUDA_EXPORT void call_kernel(float input, float* output) {
    float *d_input, *d_output;

    // Allocate device memory
    cudaMalloc(&d_input, sizeof(float));
    cudaMalloc(&d_output, sizeof(float));

    // Copy input data to device
    cudaMemcpy(d_input, &input, sizeof(float), cudaMemcpyHostToDevice);

    // Launch kernel
    my_kernel<<<1, 1>>>(d_input, d_output);

    // Copy result back to host
    cudaMemcpy(output, d_output, sizeof(float), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_input);
    cudaFree(d_output);
}

include/my_cuda_defs.h

#ifndef MY_CUDA_DEFS_H
#define MY_CUDA_DEFS_H

#include 

// Define export macro for Windows
#ifdef _WIN32
    #define CUDA_EXPORT __declspec(dllexport)
#else
    #define CUDA_EXPORT
#endif

// Device function declaration
__device__ float device_function(float x);

// Host function declaration (exported for DLLs)
extern "C" CUDA_EXPORT void call_kernel(float input, float* output);

#endif // MY_CUDA_DEFS_H

Desired apps/main.cu Here in main I assume that the ___device__ function is part of the public API and try to reference it inside my own __global__ wrapper. The motivation for this behavior is mainly for testing inside a Google Test Frame Work.

#include "my_cuda_defs.h"
#include 

#ifdef _WIN32
extern "C" void __stdcall call_kernel(float input, float* output);
#else
extern "C" void call_kernel(float input, float* output);
#endif

__global__ void wrapper_kernel(float* input, float* output, int size) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < size) {
        output[idx] = device_function(input[idx]);
    }
}

void call_device_function(const float* h_input, float* h_output, int size) {
    float *d_input, *d_output;

    // Allocate device memory
    cudaMalloc(&d_input, size * sizeof(float));
    cudaMalloc(&d_output, size * sizeof(float));

    // Copy input data to device
    cudaMemcpy(d_input, h_input, size * sizeof(float), cudaMemcpyHostToDevice);

    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (size + threadsPerBlock - 1) / threadsPerBlock;
    wrapper_kernel<<>>(d_input, d_output, size);

    // Copy output data back to host
    cudaMemcpy(h_output, d_output, size * sizeof(float), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_input);
    cudaFree(d_output);
}


int main() {
    const int size = 10;
    float h_input[size] = {1.0f, 2.0f, 3.0f, 10.0f};
    float h_output[size];

    call_device_function(h_input, h_output, size);

    for (int i = 0; i < size; i++) {
        std::cout << "Output[" << i << "] = " << h_output[i] << std::endl;
    }

    return 0;
}

The current work around apps/main.cu:

#include "my_cuda_defs.h"
#include 

#ifdef _WIN32
extern "C" void __stdcall call_kernel(float input, float* output);
#else
extern "C" void call_kernel(float input, float* output);
#endif

int main() {
    float input = 2.0f;
    float output;

    call_kernel(input, &output);

    std::cout << "Result: " << output << std::endl;
    return 0;
}

Attempts at resolving the error: This is the closest SO question, but if you look closely at their code they just end up running the kernel on the host side as opposed to what I desire which is referencing the __device__ kernel and linking to it properly and creating a wrapper for it (also as opposed to referencing a wrapper to the __device__ function).

CUDA CXX CMAKE: nvlink error : Undefined reference

As shown above, I've already tried adding this flag throughout my cmake

    CUDA_SEPARABLE_COMPILATION ON
    CUDA_RESOLVE_DEVICE_SYMBOLS ON

I've also tried turning on other flags like:

    POSITION_INDEPENDENT_CODE ON
    CUDA_LINK_EXECUTABLE_DEVICE_CODE ON

As mentioned, using the attribute/prefix __host__ also serves as a viable work around, but not exactly the desired functionality. I want to be sure that the code will run on device so decorating functions with both __host__ and __device__ is not ideal/confusing.

Question:

The issue I'm having with code structure is that __global__ kernel functions are scoped globally and can't be hidden inside a class. So I have lots of wrapper functions around my kernels that I then encapsulate inside a class.

The __global__ kernel functions can be linked against fine across multiple translation units/files.

So when it comes to test time, I do a functional test to ensure that the wrapper functions in the API are working as intended and I also do a lower level test (think of an assert/low level unit test/EXPECT_EQ()) to make sure that each kernel is working (indexing, and calculating) as intended. However, now with __device__ kernel functions there seems to be an improper usage/behavior that I am not aware of and seem to follow a different set of rules.

Also, I can't seem to find documentation on the linking behavior of __device__ functions. I thought it would be here:

14. C++ Language Support - Restrictions 14.5

How should I resolve the error above to expose/link against the __device___ function?

Updates:

Reproduced the error in Windows and Linux
Reproduced the error in a simpler and more shareable example

Notes:

If you have issues running the code you might have to adjust the architecture set in the root CMakeLists.txt to align with what is local to your system setup.

In Cmake, CUDA nvlink error: Undefined reference to '_Z15<foo>f' while trying to link to a device void foo{} function

Answers (0)

Related Questions

In Cmake, CUDA nvlink error: Undefined reference to &#39;_Z15&lt;foo&gt;f&#39; while trying to link to a __device__ void foo{} function

Answers (0)

Related Questions

In Cmake, CUDA nvlink error: Undefined reference to '_Z15<foo>f' while trying to link to a device void foo{} function