CUDA equivalent of glDispatchComputeIndirect

Question

What is the closest and efficient equivalent to a simple sequence chain of OpenGL glDispatchComputeIndirect() calls in Cuda on Pascal onwards? Where the chain of dispatches/kernels dynamically varies the launch parameter sizes (blocks/threads etc).

Is glDispatchComputeIndirect() likely just a driver side API call overhead optimisation and not really a device side mechanism?

Is glDispatchComputeIndirect() equivalent to nested parallelism with a top level kernel of one 1 block and 1 thread doing a sequence of child kernel calls?

__device__ int blocks = 1;
__device__ int threads = 1;

__global__ void Parent()
{
 Child1<<>>(); // changes blocks/threads inside child
 cudaDeviceSynchronize();
 Child2<<>>(); // changes blocks/threads inside child
 cudaDeviceSynchronize();
...
}

talonmies · Accepted Answer

Is glDispatchComputeIndirect() likely just a driver side API call overhead optimisation and not really a device side mechanism?

Almost certainly.

Is glDispatchComputeIndirect() equivalent to nested parallelism with a top level kernel of one 1 block and 1 thread doing a sequence of child kernel calls?

Almost certainly not.

Of course you would be free to use dynamic parallelism as a proxy for this functionality. It would also be possible to use mapped memory or managed memory to have the GPU write back to host accessible memory between kernel launches. But the same "pipeline" style indirect argument mechanism doesn't exist, as far as I am aware.

CUDA equivalent of glDispatchComputeIndirect

Answers (1)

Related Questions