Reputation: 35
Systems:
Windows 11 Visual Studio 17 2022 Cuda 12.5 Gpu: RTX 3070 GPU Driver: (need to update later)
Linux Ubuntu 22.04.5 LTS gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 CUDA Version: 12.2 GPU: RTX a6000 GPU Driver: 535.183.01
Docker Env System: Linux Ubuntu 20.04.6 LTS gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0 CUDA Version: 12.2 GPU: RTX a6000 GPU Driver: 535.183.01
Initial Question:
Are device functions __device__ void foo(){}
meant to be part of a public facing API? Meaning that if I wanted to ship a .so/.dll file, are __device__
functions capable of being part of that API or is this not an intended use?
A lot of the attempts at trying to solve my issue revolve around the decoupling of translation units/files of code in different folders. Thus, leading me to believe there might be other "structural" behaviors that I am not aware of. I've tried many methods to solve my error below, but I am a relatively new CUDA developer so I wanted to get past this sanity check first.
If it is possible to have __device__
kernels as part of my API and in general link to them throughout my project/ship them as part of a .dll/.so file, then I would like some help with the following:
Context:
I am having issues with properly linking to a __device__
kernel function from a cmake library. This issue originally stems from trying to test a __device__
function in a Google Test Frame work, but the issue has been simplified to just trying to link to a __device__
function.
The code might not make a lot of sense due to the number of things I've tried. Originally the kernels were __device__
only, but I added the wrappers as a work around.
Linux Error:
nvlink error : Undefined reference to '_Z15device_functionf' in 'CMakeFiles/my_cuda_app.dir/main.cu.o' (target: sm_80)
gmake[2]: *** [apps/CMakeFiles/my_cuda_app.dir/build.make:98: apps/CMakeFiles/my_cuda_app.dir/cmake_device_link.o] Error 255
gmake[1]: *** [CMakeFiles/Makefile2:142: apps/CMakeFiles/my_cuda_app.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2
Windows Error:
CUDALINK : nvlink error : Undefined reference to '_Z15device_functionf' in 'my_cuda_app.dir/Debug/main.obj' [C:\Users\chang\source\repos\simple-cuda-dev\build\apps\my_cuda_app.vcxproj]
...
C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.5.targets(908,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12
.5\bin\nvcc.exe" -dlink -o my_cuda_app.dir\Debug\my_cuda_app.device-link.obj -Xcompiler "/EHsc /W1 /nologo /Od /Zi /RTC1 /MDd /GR" -Xcompiler "/Fdmy_cuda_app.dir\Debug\vc143.pdb" -L"C:\Program Files\NVIDIA GPU
Computing Toolkit\CUDA\v12.5\bin/crt" -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\lib\x64" ..\src\Debug\my_cuda_lib.lib cudadevrt.lib cudart_static.lib kernel32.lib user32.lib gdi32.lib winspool.
lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib -forward-unknown-to-host-compiler -Wno-deprecated-gpu-targets -D_WINDOWS -Xcompiler=" /GR /EHsc" -Xcompiler=" -Zi -Ob0 -Od /RTC1" "--gene
rate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-MDd my_cuda_app.dir\Debug\main.obj" exited with code 255. [C:\Users\chang\source\repos\simple-cuda-dev\build\apps\my_cuda_app.vcxproj]
Project Structure
simple-cuda-dev/
├── CMakeLists.txt # Root CMake file
├── include/
│ └── my_cuda_code_defs.h # Header file for device function declarations
├── src/
│ ├── CMakeLists.txt # CMake file for building shared library
│ └── my_cuda_code.cu # Source file containing device functions
├── apps/
│ ├── CMakeLists.txt # CMake file for building application
│ └── main.cu # Application source file
└── build/ # Build directory (generated by CMake)
Code:
CMakeLists.txt
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(my_cuda_project LANGUAGES CXX CUDA)
# Set general compiler flags
if(CMAKE_BUILD_TYPE MATCHES Debug)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g") # Debug symbols
else()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -DNDEBUG") # Optimization
endif()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra") # General warnings
# Set global CUDA architecture (applies to all targets)
set(CMAKE_CUDA_ARCHITECTURES 80)
# Set global C++ standard
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED TRUE)
# Add subdirectories for libraries and executables
add_subdirectory(src)
add_subdirectory(apps)
src/CMakeLists.txt
# src/CMakeLists.txt
add_library(my_cuda_lib SHARED) # Create the library (no source files yet)
target_sources(my_cuda_lib PRIVATE my_cuda_code.cu) # Explicitly add CUDA source
# Enable separable compilation for CUDA and ensure PIC for shared libraries
set_target_properties(my_cuda_lib PROPERTIES
CUDA_SEPARABLE_COMPILATION ON
CUDA_RESOLVE_DEVICE_SYMBOLS ON
POSITION_INDEPENDENT_CODE ON # Important for DLLs!
)
target_include_directories(my_cuda_lib PUBLIC ${PROJECT_SOURCE_DIR}/include)
apps/CMakeLists.txt
add_executable(my_cuda_app main.cu)
# Add include directories (private scope)
target_include_directories(my_cuda_app PRIVATE ${PROJECT_SOURCE_DIR}/include)
# Link against the library built in 'src'
target_link_libraries(my_cuda_app PRIVATE my_cuda_lib)
# Explicitly link CUDA libraries
target_link_libraries(my_cuda_app PRIVATE ${CUDA_LIBRARIES})
# Set CUDA_SEPARABLE_COMPILATION for the executable
set_target_properties(my_cuda_app PROPERTIES
CUDA_SEPARABLE_COMPILATION ON
CUDA_RESOLVE_DEVICE_SYMBOLS ON
)
src/my_cuda_code.cu
#include "my_cuda_defs.h"
// Device function implementation
__device__ float device_function(float x) {
return x * x; // Simple squaring function
}
// Kernel implementation
__global__ void my_kernel(float* input, float* output) {
*output = device_function(*input);
}
// Host function implementation (exported for DLLs)
extern "C" CUDA_EXPORT void call_kernel(float input, float* output) {
float *d_input, *d_output;
// Allocate device memory
cudaMalloc(&d_input, sizeof(float));
cudaMalloc(&d_output, sizeof(float));
// Copy input data to device
cudaMemcpy(d_input, &input, sizeof(float), cudaMemcpyHostToDevice);
// Launch kernel
my_kernel<<<1, 1>>>(d_input, d_output);
// Copy result back to host
cudaMemcpy(output, d_output, sizeof(float), cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_input);
cudaFree(d_output);
}
include/my_cuda_defs.h
#ifndef MY_CUDA_DEFS_H
#define MY_CUDA_DEFS_H
#include <cuda_runtime.h>
// Define export macro for Windows
#ifdef _WIN32
#define CUDA_EXPORT __declspec(dllexport)
#else
#define CUDA_EXPORT
#endif
// Device function declaration
__device__ float device_function(float x);
// Host function declaration (exported for DLLs)
extern "C" CUDA_EXPORT void call_kernel(float input, float* output);
#endif // MY_CUDA_DEFS_H
Desired apps/main.cu
Here in main I assume that the ___device__
function is part of the public API and try to reference it inside my own __global__
wrapper. The motivation for this behavior is mainly for testing inside a Google Test Frame Work.
#include "my_cuda_defs.h"
#include <iostream>
#ifdef _WIN32
extern "C" void __stdcall call_kernel(float input, float* output);
#else
extern "C" void call_kernel(float input, float* output);
#endif
__global__ void wrapper_kernel(float* input, float* output, int size) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < size) {
output[idx] = device_function(input[idx]);
}
}
void call_device_function(const float* h_input, float* h_output, int size) {
float *d_input, *d_output;
// Allocate device memory
cudaMalloc(&d_input, size * sizeof(float));
cudaMalloc(&d_output, size * sizeof(float));
// Copy input data to device
cudaMemcpy(d_input, h_input, size * sizeof(float), cudaMemcpyHostToDevice);
// Launch kernel
int threadsPerBlock = 256;
int blocksPerGrid = (size + threadsPerBlock - 1) / threadsPerBlock;
wrapper_kernel<<<blocksPerGrid, threadsPerBlock>>>(d_input, d_output, size);
// Copy output data back to host
cudaMemcpy(h_output, d_output, size * sizeof(float), cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_input);
cudaFree(d_output);
}
int main() {
const int size = 10;
float h_input[size] = {1.0f, 2.0f, 3.0f, 10.0f};
float h_output[size];
call_device_function(h_input, h_output, size);
for (int i = 0; i < size; i++) {
std::cout << "Output[" << i << "] = " << h_output[i] << std::endl;
}
return 0;
}
The current work around apps/main.cu:
#include "my_cuda_defs.h"
#include <iostream>
#ifdef _WIN32
extern "C" void __stdcall call_kernel(float input, float* output);
#else
extern "C" void call_kernel(float input, float* output);
#endif
int main() {
float input = 2.0f;
float output;
call_kernel(input, &output);
std::cout << "Result: " << output << std::endl;
return 0;
}
Attempts at resolving the error:
This is the closest SO question, but if you look closely at their code they just end up running the kernel on the host side as opposed to what I desire which is referencing the __device__
kernel and linking to it properly and creating a wrapper for it (also as opposed to referencing a wrapper to the __device__
function).
CUDA CXX CMAKE: nvlink error : Undefined reference
As shown above, I've already tried adding this flag throughout my cmake
CUDA_SEPARABLE_COMPILATION ON
CUDA_RESOLVE_DEVICE_SYMBOLS ON
I've also tried turning on other flags like:
POSITION_INDEPENDENT_CODE ON
CUDA_LINK_EXECUTABLE_DEVICE_CODE ON
As mentioned, using the attribute/prefix __host__
also serves as a viable work around, but not exactly the desired functionality. I want to be sure that the code will run on device so decorating functions with both __host__
and __device__
is not ideal/confusing.
Question:
The issue I'm having with code structure is that __global__
kernel functions are scoped globally and can't be hidden inside a class. So I have lots of wrapper functions around my kernels that I then encapsulate inside a class.
The __global__
kernel functions can be linked against fine across multiple translation units/files.
So when it comes to test time, I do a functional test to ensure that the wrapper functions in the API are working as intended and I also do a lower level test (think of an assert/low level unit test/EXPECT_EQ()
) to make sure that each kernel is working (indexing, and calculating) as intended. However, now with __device__
kernel functions there seems to be an improper usage/behavior that I am not aware of and seem to follow a different set of rules.
Also, I can't seem to find documentation on the linking behavior of __device__
functions. I thought it would be here:
14. C++ Language Support - Restrictions 14.5
How should I resolve the error above to expose/link against the __device___
function?
Updates:
Notes:
Upvotes: -5
Views: 63