Reputation: 69
I encountered a problem when using the overloaded kernel functions in CUDA.
I can understand CUDA can launch an overloaded function by its arguments.
However, if I would like to use cudaOccupancyMaxPotentialBlockSize()
to calculate the block size for maximum occupancy, see doc.
__global__ void foo_cuda_kernel(int a)
{
/*implementation 1*/
}
//overloaded kernel function
__global__ void foo_cuda_kernel(int a, int b)
{
/*implementation 2*/
}
void foo_cuda()
{
int min_grid_size, grid_size, block_size;
cudaOccupancyMaxPotentialBlockSize
(
&min_grid_size, &block_size,
foo_cuda_kernel, //how does it distinguish overloaded functions?
0, thread_num
);
grid_size = (thread_num + block_size - 1) / block_size;
//I can understand compiler can distinguish the launched function by its arguments
foo_cuda_kernel<<<grid_size, block_size>>>((int)1);
cudaDeviceSynchronize();
}
How to make it works? How cudaOccupancyMaxPotentialBlockSize()
distinguishes overloaded functions?
Upvotes: 0
Views: 572
Reputation: 72342
As noted in comments, you can cast the function to a pointer to the correct specialization:
auto foo_ii = static_cast<void (*)(int, int)>(&foo_cuda_kernel);
auto foo_i = static_cast<void (*)(int)>(&foo_cuda_kernel);
You then pass either foo_i
or foo_ii
to cudaOccupancyMaxPotentialBlockSize
, depending on which version of the function you require.
This will work because the toolchain silently emits host boilerplate functions which wrap the underlying runtime API calls to run a kernel and enforce kernel argument type checking. The host compiler treats these wrappers like any other host function (because they are), and selects the matching version automagically.
Upvotes: 1