CLDuser
CLDuser

Reputation: 35

In Cmake, CUDA nvlink error: Undefined reference to '_Z15<foo>f' while trying to link to a __device__ void foo{} function

Systems:

Windows 11 Visual Studio 17 2022 Cuda 12.5 Gpu: RTX 3070 GPU Driver: (need to update later)

Linux Ubuntu 22.04.5 LTS gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 CUDA Version: 12.2 GPU: RTX a6000 GPU Driver: 535.183.01

Docker Env System: Linux Ubuntu 20.04.6 LTS gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0 CUDA Version: 12.2 GPU: RTX a6000 GPU Driver: 535.183.01

Initial Question:

Are device functions __device__ void foo(){} meant to be part of a public facing API? Meaning that if I wanted to ship a .so/.dll file, are __device__ functions capable of being part of that API or is this not an intended use?

A lot of the attempts at trying to solve my issue revolve around the decoupling of translation units/files of code in different folders. Thus, leading me to believe there might be other "structural" behaviors that I am not aware of. I've tried many methods to solve my error below, but I am a relatively new CUDA developer so I wanted to get past this sanity check first.

If it is possible to have __device__ kernels as part of my API and in general link to them throughout my project/ship them as part of a .dll/.so file, then I would like some help with the following:

Context:

I am having issues with properly linking to a __device__ kernel function from a cmake library. This issue originally stems from trying to test a __device__ function in a Google Test Frame work, but the issue has been simplified to just trying to link to a __device__ function.

The code might not make a lot of sense due to the number of things I've tried. Originally the kernels were __device__ only, but I added the wrappers as a work around.

Linux Error:

nvlink error   : Undefined reference to '_Z15device_functionf' in 'CMakeFiles/my_cuda_app.dir/main.cu.o' (target: sm_80)
gmake[2]: *** [apps/CMakeFiles/my_cuda_app.dir/build.make:98: apps/CMakeFiles/my_cuda_app.dir/cmake_device_link.o] Error 255
gmake[1]: *** [CMakeFiles/Makefile2:142: apps/CMakeFiles/my_cuda_app.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2

Windows Error:

CUDALINK : nvlink error : Undefined reference to '_Z15device_functionf' in 'my_cuda_app.dir/Debug/main.obj' [C:\Users\chang\source\repos\simple-cuda-dev\build\apps\my_cuda_app.vcxproj]
...
C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.5.targets(908,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12
.5\bin\nvcc.exe" -dlink  -o my_cuda_app.dir\Debug\my_cuda_app.device-link.obj -Xcompiler "/EHsc /W1 /nologo /Od /Zi /RTC1 /MDd /GR" -Xcompiler "/Fdmy_cuda_app.dir\Debug\vc143.pdb" -L"C:\Program Files\NVIDIA GPU
Computing Toolkit\CUDA\v12.5\bin/crt" -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\lib\x64" ..\src\Debug\my_cuda_lib.lib cudadevrt.lib cudart_static.lib kernel32.lib user32.lib gdi32.lib winspool.
lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib -forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -D_WINDOWS -Xcompiler=" /GR /EHsc" -Xcompiler=" -Zi -Ob0 -Od /RTC1" "--gene
rate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-MDd    my_cuda_app.dir\Debug\main.obj" exited with code 255. [C:\Users\chang\source\repos\simple-cuda-dev\build\apps\my_cuda_app.vcxproj]

Project Structure

simple-cuda-dev/
├── CMakeLists.txt          # Root CMake file
├── include/
│   └── my_cuda_code_defs.h # Header file for device function declarations
├── src/
│   ├── CMakeLists.txt      # CMake file for building shared library
│   └── my_cuda_code.cu     # Source file containing device functions
├── apps/
│   ├── CMakeLists.txt      # CMake file for building application
│   └── main.cu             # Application source file
└── build/                  # Build directory (generated by CMake)

Code:

CMakeLists.txt

cmake_minimum_required(VERSION 3.18 FATAL_ERROR)

project(my_cuda_project LANGUAGES CXX CUDA)

# Set general compiler flags
if(CMAKE_BUILD_TYPE MATCHES Debug)
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g") # Debug symbols
else()
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -DNDEBUG") # Optimization
endif()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra") # General warnings


# Set global CUDA architecture (applies to all targets)
set(CMAKE_CUDA_ARCHITECTURES 80)

# Set global C++ standard
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)

# Add subdirectories for libraries and executables
add_subdirectory(src)
add_subdirectory(apps)

src/CMakeLists.txt

# src/CMakeLists.txt
add_library(my_cuda_lib SHARED)  # Create the library (no source files yet)

target_sources(my_cuda_lib PRIVATE my_cuda_code.cu) # Explicitly add CUDA source

# Enable separable compilation for CUDA and ensure PIC for shared libraries
set_target_properties(my_cuda_lib PROPERTIES
    CUDA_SEPARABLE_COMPILATION ON
    CUDA_RESOLVE_DEVICE_SYMBOLS ON
    POSITION_INDEPENDENT_CODE ON # Important for DLLs!
)

target_include_directories(my_cuda_lib PUBLIC ${PROJECT_SOURCE_DIR}/include)

apps/CMakeLists.txt

add_executable(my_cuda_app main.cu)

# Add include directories (private scope)
target_include_directories(my_cuda_app PRIVATE ${PROJECT_SOURCE_DIR}/include)

# Link against the library built in 'src'
target_link_libraries(my_cuda_app PRIVATE my_cuda_lib)

# Explicitly link CUDA libraries
target_link_libraries(my_cuda_app PRIVATE ${CUDA_LIBRARIES})

# Set CUDA_SEPARABLE_COMPILATION for the executable
set_target_properties(my_cuda_app PROPERTIES
    CUDA_SEPARABLE_COMPILATION ON
    CUDA_RESOLVE_DEVICE_SYMBOLS ON
)

src/my_cuda_code.cu

#include "my_cuda_defs.h"

// Device function implementation
__device__ float device_function(float x) {
    return x * x; // Simple squaring function
}

// Kernel implementation
__global__ void my_kernel(float* input, float* output) {
    *output = device_function(*input);
}

// Host function implementation (exported for DLLs)
extern "C" CUDA_EXPORT void call_kernel(float input, float* output) {
    float *d_input, *d_output;

    // Allocate device memory
    cudaMalloc(&d_input, sizeof(float));
    cudaMalloc(&d_output, sizeof(float));

    // Copy input data to device
    cudaMemcpy(d_input, &input, sizeof(float), cudaMemcpyHostToDevice);

    // Launch kernel
    my_kernel<<<1, 1>>>(d_input, d_output);

    // Copy result back to host
    cudaMemcpy(output, d_output, sizeof(float), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_input);
    cudaFree(d_output);
}

include/my_cuda_defs.h

#ifndef MY_CUDA_DEFS_H
#define MY_CUDA_DEFS_H

#include <cuda_runtime.h>

// Define export macro for Windows
#ifdef _WIN32
    #define CUDA_EXPORT __declspec(dllexport)
#else
    #define CUDA_EXPORT
#endif

// Device function declaration
__device__ float device_function(float x);

// Host function declaration (exported for DLLs)
extern "C" CUDA_EXPORT void call_kernel(float input, float* output);

#endif // MY_CUDA_DEFS_H

Desired apps/main.cu Here in main I assume that the ___device__ function is part of the public API and try to reference it inside my own __global__ wrapper. The motivation for this behavior is mainly for testing inside a Google Test Frame Work.

#include "my_cuda_defs.h"
#include <iostream>

#ifdef _WIN32
extern "C" void __stdcall call_kernel(float input, float* output);
#else
extern "C" void call_kernel(float input, float* output);
#endif

__global__ void wrapper_kernel(float* input, float* output, int size) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < size) {
        output[idx] = device_function(input[idx]);
    }
}

void call_device_function(const float* h_input, float* h_output, int size) {
    float *d_input, *d_output;

    // Allocate device memory
    cudaMalloc(&d_input, size * sizeof(float));
    cudaMalloc(&d_output, size * sizeof(float));

    // Copy input data to device
    cudaMemcpy(d_input, h_input, size * sizeof(float), cudaMemcpyHostToDevice);

    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (size + threadsPerBlock - 1) / threadsPerBlock;
    wrapper_kernel<<<blocksPerGrid, threadsPerBlock>>>(d_input, d_output, size);

    // Copy output data back to host
    cudaMemcpy(h_output, d_output, size * sizeof(float), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_input);
    cudaFree(d_output);
}


int main() {
    const int size = 10;
    float h_input[size] = {1.0f, 2.0f, 3.0f, 10.0f};
    float h_output[size];

    call_device_function(h_input, h_output, size);

    for (int i = 0; i < size; i++) {
        std::cout << "Output[" << i << "] = " << h_output[i] << std::endl;
    }

    return 0;
}

The current work around apps/main.cu:

#include "my_cuda_defs.h"
#include <iostream>

#ifdef _WIN32
extern "C" void __stdcall call_kernel(float input, float* output);
#else
extern "C" void call_kernel(float input, float* output);
#endif

int main() {
    float input = 2.0f;
    float output;

    call_kernel(input, &output);

    std::cout << "Result: " << output << std::endl;
    return 0;
}

Attempts at resolving the error: This is the closest SO question, but if you look closely at their code they just end up running the kernel on the host side as opposed to what I desire which is referencing the __device__ kernel and linking to it properly and creating a wrapper for it (also as opposed to referencing a wrapper to the __device__ function).

CUDA CXX CMAKE: nvlink error : Undefined reference

As shown above, I've already tried adding this flag throughout my cmake

    CUDA_SEPARABLE_COMPILATION ON
    CUDA_RESOLVE_DEVICE_SYMBOLS ON

I've also tried turning on other flags like:

    POSITION_INDEPENDENT_CODE ON
    CUDA_LINK_EXECUTABLE_DEVICE_CODE ON

As mentioned, using the attribute/prefix __host__ also serves as a viable work around, but not exactly the desired functionality. I want to be sure that the code will run on device so decorating functions with both __host__ and __device__ is not ideal/confusing.

Question:

The issue I'm having with code structure is that __global__ kernel functions are scoped globally and can't be hidden inside a class. So I have lots of wrapper functions around my kernels that I then encapsulate inside a class.

The __global__ kernel functions can be linked against fine across multiple translation units/files.

So when it comes to test time, I do a functional test to ensure that the wrapper functions in the API are working as intended and I also do a lower level test (think of an assert/low level unit test/EXPECT_EQ()) to make sure that each kernel is working (indexing, and calculating) as intended. However, now with __device__ kernel functions there seems to be an improper usage/behavior that I am not aware of and seem to follow a different set of rules.

Also, I can't seem to find documentation on the linking behavior of __device__ functions. I thought it would be here:

14. C++ Language Support - Restrictions 14.5

How should I resolve the error above to expose/link against the __device___ function?

Updates:

Notes:

Upvotes: -5

Views: 63

Answers (0)

Related Questions