Reputation: 381
I use cub::DeviceReduce::Sum to compute the summation of a vector, but it gave me the error :
error: calling a __host__ function("cub::DeviceReduce::Sum<double *, double *> ") from a __device__ function("dotcubdev") is not allowed
error: identifier "cub::DeviceReduce::Sum<double *, double *> " is undefined in device code
The code sample is as follows:
__device__ void sumcubdev(double* a, double *sum, int N)
{
// Declare, allocate, and initialize device-accessible pointers
//for input and output
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sum-reduction
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
}
The code can run successfully in the "main{}" body, but it can't work in the function.
Upvotes: 2
Views: 636
Reputation: 152164
To use a cub device-wide function from device code, it is necessary to build your project to support CUDA dynamic parallelism. In the cub documentation, this is indicated here:
Usage Considerations Dynamic parallelism. DeviceReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
For example, you can compile the code you have shown with:
$ cat t1364.cu
#include <cub/cub.cuh>
__device__ void sumcubdev(double* a, double *sum, int N)
{
// Declare, allocate, and initialize device-accessible pointers
//for input and output
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sum-reduction
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
}
$ nvcc -arch=sm_35 -dc t1364.cu
$
(CUDA 9.2, CUB 1.8.0)
This means CUB will be launching child kernels to get the work done.
This is not a complete tutorial on how to use CUDA Dynamic Parallelism (CDP). The above is the compile command only and omits the link step. There are many questions here on the cuda
tag which discuss CDP, you can read about it in two blog articles and the programming guide, and there are CUDA sample projects showing how to compile and use it.
Upvotes: 3