coastal
coastal

Reputation: 151

Switch host functions depending on CUDA compute capability at runtime

I currently have a host function which includes a loop and a variety of CUBLAS calls. Now having access to CC 3.5 devices, I can write a single much more efficient kernel using dynamic parallelism. However, I want to continue to support the old function for CC < 3.5 devices. I now support multiple devices in the same binary with a couple of gencodes:

-gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35

I'd like to continue to produce a single binary supporting both architectures, but I can think of no way to switch this in host code. NVCC certainly can't generate a compiled code image for anything on the host AFAIK.

This is no good (and also terribly ugly) since users building for CC < 3.5 won't be able to build the kernel using 3.5 features:

cudaGetDevice (&current_device);
cudaGetDeviceProperties (&current_device_properties, current_device);
if (current_device_properties.major < 3 && ... etc) {
  ...
}
else ...

__CUDACC__ or __CUDA_ARCH__ aren't useful here either.

My guess is that this is not possible and I will have to simply start compiling separate binaries and switch architectures in the preprocessor. But, if anyone can think of anything, great.

Upvotes: 0

Views: 394

Answers (1)

Tom Scogland
Tom Scogland

Reputation: 938

It depends on what your goal is. You seem to be asking about two different cases here.

First, if you believe a user might compile the code with an nvcc which does not support CC 3.5, then you will need to use preprocessor checks on CUDA_ARCH to test the compute capability and prevent it from attempting to compile the unsupported code.

Second, if you intend to compile the code to include implementations for both CC 3.5 and lower capabilities together, you should use the cudaGetDeviceProperties check as you have already noted to select the correct host implementation.

If you want both of these simultaneously, you will likely need to use an implementation that looks much like this.

cudaGetDevice (&current_device);
cudaGetDeviceProperties (&cdp, current_device);
if (cdp.major < 3 || (cdp.major >= 3 &&  cdp.minor < 5)) {
  //loop and CUBLAS
}else {
  kernel35<<<>>>();
}

Likewise your kernels would have to be guarded by the __CUDA_ARCH__ >= 350.

#if (__CUDA_ARCH__ >= 350)
__global__ void kernel35()
{
  ...
}
#else
__global__ void kernel35()
{
  //fake stub kernel to allow non 35 compatible nvcc to build the code
}
#endif

Also, I imagine you have tested that the new kernel is more efficient, but if the number of iterations is known ahead of time, dynamic parallelism is almost always slower than launching correctly from the CPU. In my tests by as much as 40%, so I would recommend testing the performance thoroughly before making this switch for Kepler GPUs.

edit: It occurs to me that the more compatible, and safer, option would be to phrase the second portion like this.

__global void kernel35(){
  #if (__CUDA_ARCH__ >=350 )
  ...
  #else
  //stub
  #endif
}

Upvotes: 1

Related Questions