ztdep
ztdep

Reputation: 381

cuda::cub error calling a __host__ function from a __device__ functionis not allowed

I use cub::DeviceReduce::Sum to compute the summation of a vector, but it gave me the error :

error: calling a __host__ function("cub::DeviceReduce::Sum<double *, double *> ") from a __device__ function("dotcubdev") is not allowed
error: identifier "cub::DeviceReduce::Sum<double *, double *> " is undefined in device code

The code sample is as follows:

__device__ void sumcubdev(double* a, double *sum, int N)
{
    // Declare, allocate, and initialize device-accessible pointers 
     //for input and output
    // Determine temporary device storage requirements
    void     *d_temp_storage = NULL;
    size_t   temp_storage_bytes = 0;
     cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
    // Allocate temporary storage
    cudaMalloc(&d_temp_storage, temp_storage_bytes);
    // Run sum-reduction
     cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
}

The code can run successfully in the "main{}" body, but it can't work in the function.

Upvotes: 2

Views: 636

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152164

To use a cub device-wide function from device code, it is necessary to build your project to support CUDA dynamic parallelism. In the cub documentation, this is indicated here:

Usage Considerations Dynamic parallelism. DeviceReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

For example, you can compile the code you have shown with:

$ cat t1364.cu
#include <cub/cub.cuh>
__device__ void sumcubdev(double* a, double *sum, int N)
{
    // Declare, allocate, and initialize device-accessible pointers
     //for input and output
    // Determine temporary device storage requirements
    void     *d_temp_storage = NULL;
    size_t   temp_storage_bytes = 0;
     cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
    // Allocate temporary storage
    cudaMalloc(&d_temp_storage, temp_storage_bytes);
    // Run sum-reduction
     cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
}
$ nvcc -arch=sm_35 -dc t1364.cu
$

(CUDA 9.2, CUB 1.8.0)

This means CUB will be launching child kernels to get the work done.

This is not a complete tutorial on how to use CUDA Dynamic Parallelism (CDP). The above is the compile command only and omits the link step. There are many questions here on the cuda tag which discuss CDP, you can read about it in two blog articles and the programming guide, and there are CUDA sample projects showing how to compile and use it.

Upvotes: 3

Related Questions