Reputation: 11
As the title says, I wonder whether it is possible to launch a sort of __syncthreads()
function, where the barrier is not at block level but at sub-block level, so that I can sync all threads having a particular threadIdx.x?
For instance, if I define a kernel launch as <<<1, (32, 32)>>>, is it possible to define something like __syncthreads(5)
so that it syncs all threads having threadIdx.x == 5?
Following the documentation, it seems that CUDA does not define such a function; however, I wonder whether there exists some trick that can achieve the same result.
Upvotes: 1
Views: 284
Reputation: 151879
Generally, no this is not possible in CUDA. There are no provided methods to do this.
CUDA does provide __syncwarp()
which allows synchronization of a warp (32 threads).
The CUDA cooperative groups mechanism does allow for synchronization "only" of subgroups of threads. But you do not have an arbitrary method to assign threads to groups.
At the PTX level, there is more flexibility in the use of barriers. But you don't have the ability to assign an arbitrary set of threads to a barrier. (Instead, for example, arriving threads may simply be "counted").
My suggestion would be to use one of the above methods. For example, if you wanted to assign all threads with threadIdx.x == 5 in a (32,32) threadblock, that is 32 threads the same as a warp. Reassign your thread assignment pattern so that those 32 threads belong to the same warp, and use __syncwarp()
.
Upvotes: 3