Reputation: 99
I'm a newbie at CUDA programming, but I need use it in a complex project. I really need some help.
My question is if I want to execute a child kernel 256 times concurrently what can I do with Dynamic Parallelism?
I read an NVIDIA blog , and it says:
By default, grids launched within a thread block are executed sequentially: the next grid starts executing only after the previous one has finished. This happens even if grids are launched by different threads within the block.
So, my idea is setting block size(1,1) and grid size(256,1) for the parent kernel and I can launch the child kernel concurrently with 256 threads in different blocks. Will it be very inefficient? What's a better solution?
Upvotes: 1
Views: 308
Reputation: 11926
That quote continues with
Often, however, more concurrency is desired; as with host-side kernel launches, we can use CUDA streams to achieve this. All streams created on the device are non-blocking; that is, they do not support implicit synchronization with the default NULL stream. Therefore, what follows is the only way to create a stream in device code.
cudaStream_t s;
cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);
Then using a different (device-side)stream per CUDA thread should make them run independently instead of the default one.
Additionally, you can coalesce multiple launches into one big launch using some reduction algorithm between parent threads in a parent block. Incrementing total number of threads of child kernel and their mapping from thread id to problem space. This should overcome the performance issue of small kernels combined with the maximum number of concurrent kernel executions per device(4 to 128 depending on Cuda Compute Capability) supported by hardware.
Upvotes: 2