Reputation: 7809
I just read the documentation about dynamic parallelism. What I wonder: Can I use cudaDeviceSynchronize()
inside a kernel to synchronize all blocks currently running on that device?
The documentation says:
CUDA runtime operations from any thread, including kernel launches, are visible across a thread block. This means that an invoking thread in the parent grid may perform synchronization on the grids launched by that thread, by other threads in the thread block, or on streams created within the same thread block.
Furthermore:
Streams and events created within a grid exist within thread block scope but have undefined behavior when used outside of the thread block where they were created.
That's basically a NO to my question. BUT since cudaDeviceSynchronize()
uses a global stream for the whole device, I'm not sure whether that stream might be visible and the same for ALL threads on the device, no matter to which block or launch they belong. So that I could actually use cudaDeviceSynchronize()
inside a kernel for global synchronisation.
Upvotes: 0
Views: 3561
Reputation: 1781
No. There is no way to safely do a device-wide synchronisation.
Section C.3.1.4 of the programming guide (Link):
The cudaDeviceSynchronize() function will synchronize on all work launched by any thread in the thread-block up to the point where cudaDeviceSynchronize() was called.
It says nothing about interacting with other thread blocks.
Global synchronisation in CUDA would, in general, cause problems due to the over-subscription method most commonly used to fill GPUs with work. The number of blocks to synchronize would typically be much larger than would fit on the device, so the context of each would have to be swapped in and out of global memory, destroying performance.
There are hacks you can use to get around this if you know you have a special case, but typically, the easiest and most efficient way to synchronize blocks is to exit the kernel and launch a new one.
Upvotes: 6