Is there implicit barrier synchronisation at the start or end of a CUDA __device__ function?

Question

If a CUDA Kernel calls a __device__ function is there any implicit synchronisation between all the threads in the block either entering or exiting the __device__ function?

If not then it means some threads in the block could have exited the __device__ function before other threads in the block have even entered it (in the absence of any explicit synchronisation)?

Any pointers to relevant information/references would be appreciated.

user1545642 · Accepted Answer

yes there is only implicit synchronization btw threads in a warp as talonmies pointed out. When the kernel is launched, hardware peeks up any warp (probably the first one) and executes the first instruction for it, then it switches to another warp. It is potentially possible that some warp gets retired before another warp has even executed the first instruction for the kernel since warps are not required to wait for one another on exit

Is there implicit barrier synchronisation at the start or end of a CUDA device function?

Answers (2)

Related Questions