__CUDA_ARCH__ flag with Thrust execution policy

Question

I have a __host__ __device__ function which is a wrapper that calls into "sort" function of the thrust library. Inside this wrapper, I am using the __CUDA_ARCH__ flag to set the execution policy to "thrust::device" when called from host and "thrust::seq" when called from device. The following piece of code generates a runtime error -

#ifndef __CUDA_ARCH__
    thrust::stable_sort(thrust::device, data, data + num, customGreater());
#else
    thrust::stable_sort(thrust::seq, data, data + num, customGreater());
#endif

The error is-

Unexpected Standard exception: What() is:merge_sort: failed on 2nd step: invalid device function

As per my understanding, CUDA_ARCH can be used for conditional compilation. I request for help in understanding why this error is thrown.

Robert Crovella · Accepted Answer

It seems you are stepping on this issue. In a nutshell, thrust uses CUB functionality under the hood for certain algorithms (including sort). Your use of __CUDA_ARCH__ macro in your code, which wraps around thrust algorithm calls that use CUB, is interfering with CUB code that expects to be able to use this macro for all paths.

A possible workaround is to do "your own dispatch":

$ cat t142.cu
#include 
#include 
#include 


template 
struct customGreater {
__host__ __device__ bool operator()(T &t1, T &t2){
   return (t1 > t2);}
};

template 
__host__ __device__
void my_sort_wrapper(T *data, size_t num){
    int hostdev = 0;  // 0=device code
#ifndef __CUDA_ARCH__
    hostdev = 1;  // 1=host code
#endif
    if (hostdev == 0) thrust::stable_sort(thrust::seq, data, data + num, customGreater());
    else thrust::stable_sort(thrust::device, data, data + num, customGreater());

}

template 
__global__ void my_dev_sort(T *data, size_t num){
  my_sort_wrapper(data, num);
}
typedef int mytype;
const size_t sz = 10;
int main(){
  mytype *d_data;
  cudaMalloc(&d_data, sz*sizeof(mytype));
  cudaMemset(d_data, 0, sz*sizeof(mytype));
  my_sort_wrapper(d_data, sz);
  my_dev_sort<<<1,1>>>(d_data, sz);
  cudaDeviceSynchronize();
}
$ nvcc t142.cu -o t142
$ cuda-memcheck ./t142
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

With this realization, the use of the __CUDA_ARCH__ macro does not perturb the compilation of the thrust algorithms.

Another possible workaround is simply to use thrust::device policy for both cases (no dispatch - just the thrust algorithm call). Except in the case of CUDA Dynamic Parallelism, thrust::device will "decay" to thrust::seq when used in device code.

I would expect that these suggestions would only be necessary/relevant when the thrust algorithm uses CUB functionality in the underlying implementation.

If you don't like this behavior, you could file a thrust issue.

__CUDA_ARCH__ flag with Thrust execution policy

Answers (2)

Related Questions