How to execute a child kernel 256 times concurrently in CUDA

Question

I'm a newbie at CUDA programming, but I need use it in a complex project. I really need some help.

My question is if I want to execute a child kernel 256 times concurrently what can I do with Dynamic Parallelism?

I read an NVIDIA blog , and it says:

By default, grids launched within a thread block are executed sequentially: the next grid starts executing only after the previous one has finished. This happens even if grids are launched by different threads within the block.

So, my idea is setting block size(1,1) and grid size(256,1) for the parent kernel and I can launch the child kernel concurrently with 256 threads in different blocks. Will it be very inefficient? What's a better solution?

How to execute a child kernel 256 times concurrently in CUDA

Answers (1)

Related Questions