Vandana
Vandana

Reputation: 137

__CUDA_ARCH__ flag with Thrust execution policy

I have a __host__ __device__ function which is a wrapper that calls into "sort" function of the thrust library. Inside this wrapper, I am using the __CUDA_ARCH__ flag to set the execution policy to "thrust::device" when called from host and "thrust::seq" when called from device. The following piece of code generates a runtime error -

#ifndef __CUDA_ARCH__
    thrust::stable_sort(thrust::device, data, data + num, customGreater<T>());
#else
    thrust::stable_sort(thrust::seq, data, data + num, customGreater<T>());
#endif

The error is-

Unexpected Standard exception: What() is:merge_sort: failed on 2nd step: invalid device function

As per my understanding, CUDA_ARCH can be used for conditional compilation. I request for help in understanding why this error is thrown.

Upvotes: 1

Views: 547

Answers (2)

blelbach
blelbach

Reputation: 486

Unfortunately, we can't fix this in Thrust. The trouble here is that the NVCC compiler needs to see all __global__ function template instantiations during host compilation (e.g. when __CUDA_ARCH__ is not defined), otherwise the kernels will be treated as unused and discarded. See this CUB GitHub issue for more details.

As Robert suggested, a workaround such as this should be fine:

#include <iostream>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>

template <typename T>
struct customGreater {
__host__ __device__ bool operator()(T &t1, T &t2){
   return (t1 > t2);}
};

#if defined(__CUDA_ARCH__)
  #define DEVICE_COMPILATION 1
#else
  #define DEVICE_COMPILATION 0
#endif

template <typename T>
__host__ __device__
void my_sort(T *data, size_t num){
  if (DEVICE_COMPILATION)
    thrust::stable_sort(thrust::device, data, data + num, customGreater<T>());
  else
    thrust::stable_sort(thrust::seq, data, data + num, customGreater<T>());
}

template <typename T>
__global__ void my_dev_sort(T *data, size_t num){
  my_sort(data, num);
}
typedef int mytype;
const size_t sz = 10;
int main(){
  mytype *d_data;
  cudaMallocManaged(&d_data, sz*sizeof(mytype));
  cudaMemset(d_data, 0, sz*sizeof(mytype));
  my_sort(d_data, sz);
  my_dev_sort<<<1,1>>>(d_data, sz);
  cudaFree(d_data);
  cudaDeviceSynchronize();
}

Upvotes: 3

Robert Crovella
Robert Crovella

Reputation: 151869

It seems you are stepping on this issue. In a nutshell, thrust uses CUB functionality under the hood for certain algorithms (including sort). Your use of __CUDA_ARCH__ macro in your code, which wraps around thrust algorithm calls that use CUB, is interfering with CUB code that expects to be able to use this macro for all paths.

A possible workaround is to do "your own dispatch":

$ cat t142.cu
#include <iostream>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>


template <typename T>
struct customGreater {
__host__ __device__ bool operator()(T &t1, T &t2){
   return (t1 > t2);}
};

template <typename T>
__host__ __device__
void my_sort_wrapper(T *data, size_t num){
    int hostdev = 0;  // 0=device code
#ifndef __CUDA_ARCH__
    hostdev = 1;  // 1=host code
#endif
    if (hostdev == 0) thrust::stable_sort(thrust::seq, data, data + num, customGreater<T>());
    else thrust::stable_sort(thrust::device, data, data + num, customGreater<T>());

}

template <typename T>
__global__ void my_dev_sort(T *data, size_t num){
  my_sort_wrapper(data, num);
}
typedef int mytype;
const size_t sz = 10;
int main(){
  mytype *d_data;
  cudaMalloc(&d_data, sz*sizeof(mytype));
  cudaMemset(d_data, 0, sz*sizeof(mytype));
  my_sort_wrapper(d_data, sz);
  my_dev_sort<<<1,1>>>(d_data, sz);
  cudaDeviceSynchronize();
}
$ nvcc t142.cu -o t142
$ cuda-memcheck ./t142
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

With this realization, the use of the __CUDA_ARCH__ macro does not perturb the compilation of the thrust algorithms.

Another possible workaround is simply to use thrust::device policy for both cases (no dispatch - just the thrust algorithm call). Except in the case of CUDA Dynamic Parallelism, thrust::device will "decay" to thrust::seq when used in device code.

I would expect that these suggestions would only be necessary/relevant when the thrust algorithm uses CUB functionality in the underlying implementation.

If you don't like this behavior, you could file a thrust issue.

Upvotes: 3

Related Questions