Reputation: 1713
What is the closest and efficient equivalent to a simple sequence chain of OpenGL glDispatchComputeIndirect() calls in Cuda on Pascal onwards? Where the chain of dispatches/kernels dynamically varies the launch parameter sizes (blocks/threads etc).
Is glDispatchComputeIndirect() likely just a driver side API call overhead optimisation and not really a device side mechanism?
Is glDispatchComputeIndirect() equivalent to nested parallelism with a top level kernel of one 1 block and 1 thread doing a sequence of child kernel calls?
__device__ int blocks = 1;
__device__ int threads = 1;
__global__ void Parent()
{
Child1<<<blocks, threads>>>(); // changes blocks/threads inside child
cudaDeviceSynchronize();
Child2<<<blocks, threads>>>(); // changes blocks/threads inside child
cudaDeviceSynchronize();
...
}
Upvotes: 0
Views: 487
Reputation: 72349
Is glDispatchComputeIndirect() likely just a driver side API call overhead optimisation and not really a device side mechanism?
Almost certainly.
Is glDispatchComputeIndirect() equivalent to nested parallelism with a top level kernel of one 1 block and 1 thread doing a sequence of child kernel calls?
Almost certainly not.
Of course you would be free to use dynamic parallelism as a proxy for this functionality. It would also be possible to use mapped memory or managed memory to have the GPU write back to host accessible memory between kernel launches. But the same "pipeline" style indirect argument mechanism doesn't exist, as far as I am aware.
Upvotes: 1