CUDA order of block execution in single kernel launch

Question

I'm launching 256 threads in total. When I do it by launching a single block, everything works fine. But when I launch the threads in 2x2 blocks each with (8x8 threads), the kernel loops infinitely. Well, the real problem is that my kernel code waits for partial results from other blocks and after running several tests, I observed that the blocks were launched in a random order and they seem to be executed in a sequential order.

Do CUDA blocks run in parallel if they're launched from the same kernel? The GPU I'm using is not a limitation since I'm launching only 256 threads and a GTX 580 can handle them. (everything works fine in a single block launch of 16x16 threads) Is there a way I can know the order of execution or maybe specify it?

stuhlo · Accepted Answer

Yes, blocks run in parallel. How many blocks are run in parallel is determined by performance of your GPU, but important thing is that launching order of blocks is undefined and indefinable. Read more here - chapter 2.2, last three paragraphs.

CUDA order of block execution in single kernel launch

Answers (1)

Related Questions