Reputation: 113
I can't figure out the following.
If I launch a kernel and consider, for example, thread 0
in block 0
, after a __syncthreads()
call, will all the other threads in all the other blocks see the changes made to global memory by thread 0
in block 0
?
My guess is no. Indeed, in the the synchronization functions Section of the CUDA C Programming Guide, it is stated that:
void __syncthreads();
waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to__syncthreads()
are visible to all threads in the block.
However, when talking about global memory consistency in dynamic parallelism, the CUDA C Programming Guide states that:
Those modifications become available to the other threads of the parent grid only after the second
__syncthreads()
call.
So does __syncthreads()
also makes the changes available across blocks when dynamic parallelism is involved?
Thanks
Upvotes: 2
Views: 2223
Reputation: 21515
The only action performed by __syncthreads()
is that quoted by yourself described in the CUDA C Programming Guide. There is no way in CUDA to synchronize across blocks, apart from the naive approach of dividing the execution of a kernel in multiple kernel launches, with all the drawbacks in terms of performance. Accordingly, the answer to your first question, as also guessed by yourself, is NO.
In the second part of your post, you are referring to a specific example of the CUDA C Programming Guide, namely
__global__ void child_launch(int *data) {
data[threadIdx.x] = data[threadIdx.x]+1;
}
__global__ void parent_launch(int *data) {
data[threadIdx.x] = threadIdx.x;
__syncthreads();
if (threadIdx.x == 0) {
child_launch<<< 1, 256 >>>(data);
cudaDeviceSynchronize();
}
__syncthreads();
}
void host_launch(int *data) {
parent_launch<<< 1, 256 >>>(data);
}
Here, all the 256
threads of the parent_launch
kernel write something in data
. After that, thread 0
invokes child_launch
. The first __syncthreads()
is needed to ensure that all the memory writes have completed before that child kernel invokation. Quoting the guide on this point:
Due to the first
__syncthreads()
call, the child will seedata[0]=0
,data[1]=1
, ...,data[255]=255
(without the__syncthreads()
call, onlydata[0]
would be guaranteed to be seen by the child).
Regarding the second __syncthreads()
, the Guide explains that
When the child grid returns, thread
0
is guaranteed to see modifications made by the threads in its child grid. Those modifications become available to the other threads of the parent grid only after the second__syncthreads()
call.
In that specific example, the second __syncthreads()
is redundant since there is an implicit synchronization due to the kernel termination, but the second __syncthreads()
becomes needed when other operations must be performed following the child kernel launch.
Finally, concerning the sentence you are quoting in your post:
Those modifications become available to the other threads of the parent grid only after the second
__syncthreads()
call
please, note that in the specific example there is only one thread block launched by the host_launch
function. This perhaps may have somewhat misled you.
There is an interesting discussion (probably even more than one) on the NVIDIA Forum on thread synchronization across blocks entitled
Synchronize all blocks in CUDA
Upvotes: 4