Reputation: 131554

What exact constraints must kernel parameter types satisfy?

When writing CUDA kernels, common wisdom is to stick to trivially-copyable types for parameters - as launching a kernel means copying arguments to constant memory on the device (and possibly later into thread-specific registers).

But is that the necessary and sufficient condition for a kernel (__global__ function) parameter? I couldn't find this formalized in the CUDA Programming Guide, nor in the C++ Best Practices Guide.

Upvotes: 1

Answers (1)

Abator Abetor

Reputation: 2598

Update: In the CUDA programming guide v11.4.1, the following restrictions are mentioned in section I.4.9.3.1 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-function-argument-processing

When a __global__ function is launched from host code, each argument type is allowed to be non-trivially copyable or non-trivially-destructible, but the processing for such types does not follow the standard C++ model, as described below. User code must ensure that this workflow does not affect program correctness. The workflow diverges from standard C++ in two areas:

Memcpy instead of copy constructor invocation When lowering a __global__ function launch from host code, the compiler generates stub functions that copy the parameters one or more times by value, before eventually using memcpy to copy the arguments to the __global__ function's parameter memory on the device. This occurs even if an argument was non-trivially-copyable, and therefore may break programs where the copy constructor has side effects.

Destructor may be invoked before the __global__ function has finished Kernel launches are asynchronous with host execution. As a result, if a __global__ function argument has a non-trivial destructor, the destructor may execute in host code even before the __global__ function has finished execution. This may break programs where the destructor has side effects.

My old answer:

I do not have a definitive answer. This answer is based on my own observations.

cudaLaunch*Kernel functions take untyped void* pointers to kernel arguments from which the arguments are copied byte-wise. The number of bytes is determined automatically. Quoting the driver API documentation of cudaLauchKernel:

Each of kernelParams[0] through kernelParams[N-1] must point to a region of memory from which the actual kernel parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image.

Taking this into account I would agree that kernel arguments should be trivially copyable.

However, using <<< >>> notation gives the impression that non-trivially copyable types are possible, too, because an implicit copy-construction is performed in the code generated by nvcc prior to the launch.

Consider the following code which can be compiled via nvcc -arch=sm_61 -O3 main.cu -o main:

#include <cstdio>
#include <cassert>
#include <type_traits>

struct Foo{
    bool iscopy = false;
    
    __host__ __device__
    Foo(){
    }
    
    __host__ __device__
    ~Foo(){
        printf("~Foo() iscopy %d\n", iscopy);
    }
    
    __host__ __device__
    Foo(const Foo& rhs) : iscopy(true) {printf("Foo(const Foo&)\n");}
    
    __host__ __device__
    Foo& operator=(const Foo& rhs){ printf("operator=(const Foo&\n"); iscopy = true; return *this; }    
    
    //device only
    __device__
    int access(int i) const{
        printf("access %d, iscopy = %d\n", i, iscopy);
        return 3;
    }       
};


static_assert(std::is_trivially_copyable<Foo>::value == false);

__global__
void kernel(Foo f){
    f.access(13);
}

__global__
void kernel2(Foo f){
    f.access(42);
}

int main(){
    Foo hostfoo;
    
    kernel<<<1,1>>>(hostfoo);
    cudaError_t status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);
    
    
    void* args[1] = {&hostfoo};
    
    status = cudaLaunchKernel((void*)&kernel2, dim3(1), dim3(1), &args[0], size_t(0));
    assert(status == cudaSuccess);
    
    status = cudaDeviceSynchronize();
    assert(status == cudaSuccess);
    
    printf("kernels finished\n");
    
    return 0;
}

The output is

Foo(const Foo&)
~Foo() iscopy 1
access 13, iscopy = 1
access 42, iscopy = 0
kernels finished
~Foo() iscopy 0

When using <<< >>> notation to launch the kernel the copy-constructor is called from the implicit wrapper functions for kernel launch generated by nvcc. iscopy = 1

However, this is not the case with cudaLaunchKernel. Note iscopy = 0 in the output.

Upvotes: 2

What exact constraints must kernel parameter types satisfy?

Answers (1)

Related Questions