fyl
fyl

Reputation: 69

overload cuda kernel function

I encountered a problem when using the overloaded kernel functions in CUDA.

I can understand CUDA can launch an overloaded function by its arguments.

However, if I would like to use cudaOccupancyMaxPotentialBlockSize() to calculate the block size for maximum occupancy, see doc.

__global__ void foo_cuda_kernel(int a)
{
  /*implementation 1*/
}

//overloaded kernel function
__global__ void foo_cuda_kernel(int a, int b)
{
  /*implementation 2*/
}

void foo_cuda()
{
  int min_grid_size, grid_size, block_size;
  cudaOccupancyMaxPotentialBlockSize
  (
    &min_grid_size, &block_size, 
    foo_cuda_kernel, //how does it distinguish overloaded functions?
    0, thread_num
  );
  grid_size = (thread_num + block_size - 1) / block_size;
  
  //I can understand compiler can distinguish the launched function by its arguments
  foo_cuda_kernel<<<grid_size, block_size>>>((int)1);
  cudaDeviceSynchronize();
}

How to make it works? How cudaOccupancyMaxPotentialBlockSize() distinguishes overloaded functions?

Upvotes: 0

Views: 572

Answers (1)

talonmies
talonmies

Reputation: 72342

As noted in comments, you can cast the function to a pointer to the correct specialization:

auto foo_ii = static_cast<void (*)(int, int)>(&foo_cuda_kernel);
auto foo_i = static_cast<void (*)(int)>(&foo_cuda_kernel);

You then pass either foo_i or foo_ii to cudaOccupancyMaxPotentialBlockSize, depending on which version of the function you require.

This will work because the toolchain silently emits host boilerplate functions which wrap the underlying runtime API calls to run a kernel and enforce kernel argument type checking. The host compiler treats these wrappers like any other host function (because they are), and selects the matching version automagically.

Upvotes: 1

Related Questions