Reputation: 137
I have a __host__ __device__
function which is a wrapper that calls into "sort" function of the thrust library. Inside this wrapper, I am using the __CUDA_ARCH__
flag to set the execution policy to "thrust::device" when called from host and "thrust::seq" when called from device. The following piece of code generates a runtime error -
#ifndef __CUDA_ARCH__
thrust::stable_sort(thrust::device, data, data + num, customGreater<T>());
#else
thrust::stable_sort(thrust::seq, data, data + num, customGreater<T>());
#endif
The error is-
Unexpected Standard exception: What() is:merge_sort: failed on 2nd step: invalid device function
As per my understanding, CUDA_ARCH can be used for conditional compilation. I request for help in understanding why this error is thrown.
Upvotes: 1
Views: 547
Reputation: 486
Unfortunately, we can't fix this in Thrust. The trouble here is that the NVCC compiler needs to see all __global__
function template instantiations during host compilation (e.g. when __CUDA_ARCH__
is not defined), otherwise the kernels will be treated as unused and discarded. See this CUB GitHub issue for more details.
As Robert suggested, a workaround such as this should be fine:
#include <iostream>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
template <typename T>
struct customGreater {
__host__ __device__ bool operator()(T &t1, T &t2){
return (t1 > t2);}
};
#if defined(__CUDA_ARCH__)
#define DEVICE_COMPILATION 1
#else
#define DEVICE_COMPILATION 0
#endif
template <typename T>
__host__ __device__
void my_sort(T *data, size_t num){
if (DEVICE_COMPILATION)
thrust::stable_sort(thrust::device, data, data + num, customGreater<T>());
else
thrust::stable_sort(thrust::seq, data, data + num, customGreater<T>());
}
template <typename T>
__global__ void my_dev_sort(T *data, size_t num){
my_sort(data, num);
}
typedef int mytype;
const size_t sz = 10;
int main(){
mytype *d_data;
cudaMallocManaged(&d_data, sz*sizeof(mytype));
cudaMemset(d_data, 0, sz*sizeof(mytype));
my_sort(d_data, sz);
my_dev_sort<<<1,1>>>(d_data, sz);
cudaFree(d_data);
cudaDeviceSynchronize();
}
Upvotes: 3
Reputation: 151869
It seems you are stepping on this issue. In a nutshell, thrust uses CUB functionality under the hood for certain algorithms (including sort). Your use of __CUDA_ARCH__
macro in your code, which wraps around thrust algorithm calls that use CUB, is interfering with CUB code that expects to be able to use this macro for all paths.
A possible workaround is to do "your own dispatch":
$ cat t142.cu
#include <iostream>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
template <typename T>
struct customGreater {
__host__ __device__ bool operator()(T &t1, T &t2){
return (t1 > t2);}
};
template <typename T>
__host__ __device__
void my_sort_wrapper(T *data, size_t num){
int hostdev = 0; // 0=device code
#ifndef __CUDA_ARCH__
hostdev = 1; // 1=host code
#endif
if (hostdev == 0) thrust::stable_sort(thrust::seq, data, data + num, customGreater<T>());
else thrust::stable_sort(thrust::device, data, data + num, customGreater<T>());
}
template <typename T>
__global__ void my_dev_sort(T *data, size_t num){
my_sort_wrapper(data, num);
}
typedef int mytype;
const size_t sz = 10;
int main(){
mytype *d_data;
cudaMalloc(&d_data, sz*sizeof(mytype));
cudaMemset(d_data, 0, sz*sizeof(mytype));
my_sort_wrapper(d_data, sz);
my_dev_sort<<<1,1>>>(d_data, sz);
cudaDeviceSynchronize();
}
$ nvcc t142.cu -o t142
$ cuda-memcheck ./t142
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
With this realization, the use of the __CUDA_ARCH__
macro does not perturb the compilation of the thrust algorithms.
Another possible workaround is simply to use thrust::device
policy for both cases (no dispatch - just the thrust algorithm call). Except in the case of CUDA Dynamic Parallelism, thrust::device
will "decay" to thrust::seq
when used in device code.
I would expect that these suggestions would only be necessary/relevant when the thrust algorithm uses CUB functionality in the underlying implementation.
If you don't like this behavior, you could file a thrust issue.
Upvotes: 3