Reputation: 3520
I have b number of blocks and each block has t number of threads. I can use
__syncthreads()
to synchronize the threads that are in a particular block. for example
__global__ void aFunction()
{
for(i=0;i<10;i++)
{
//execute something
__syncthreads();
}
}
But my problem is to synchronize all the threads in all the blocks. How can I do this?
Upvotes: 38
Views: 36128
Reputation: 76724
Cooperative groups have some requirements, such as needing to launch your kernel via cudaLaunchCooperativeKernel
. Which makes it not a good solution for simple projects.
An easy alternative is using atomics with bitfields, like so:
// A global var with 64 bits can track 64 blocks,
// use an array if you need to track more blocks
__device__ uint64_t CompleteMask;
//This is where we put in all the smarts
//from the CPU reference solver
__global__ void doWork() {
atomicAnd(&CompleteMask, 0);
//do lots of work
const auto SollMask = (1 << gridDim.x) - 1;
if (ThreadId() == 0) {
while ((atomicOr(&CompleteMask, 1ULL << blockIdx.x)) != SollMask) { /*do nothing*/ }
}
if (ThreadId() == 0 && 0 == blockIdx.x) {
printf("Print a single line for the entire process")
}
}
Because every block is assigned its own bit in the mask, they can never interfere. If you have more than 64 blocks, use an array to track the bits and atomicAdd
to track the count like so:
// A global var with 64 bits can track 64 blocks,
// use an array if you need to track more blocks
__device__ int CompleteMask[2];
__device__ int CompleteSuperMask;
__global__ void doWork() {
for (auto i = 0; i < 2; i++) { atomicAnd(&CompleteMask[i], 0); }
atomicAnd(&CompleteSuperMask, 0);
//do lots of work
int SollMask[3];
SollMask[0] = -1;
SollMask[1] = (1 << (gridDim.x % 32)) - 1;
SollMask[2] = (1 << (gridDim.x / 32)) - 1;
const auto b = blockIdx.x / 32;
while (atomicOr(&CompleteMask[b], (1U << (blockIdx.x % 32))) != SollMask[b]) { /*do nothing*/ }
while (atomicOr(&CompleteSuperMask, (1U << b)) != SollMask[2]) { /*do nothing*/ }
if (threadIdx.x == 0 && blockIdx.x == 0) {
printf("Print a single line for the entire process");
}
}
It works pretty well when no. of blocks are less or equal no. of SMs. However, if your block count exceeds the physical number of SM's this leads to freezing up the execution. This is a limitation of the solution.
Upvotes: 3
Reputation: 21818
In CUDA 9, NVIDIA is introducing the concept of cooperative groups, allowing you to synchronize all threads belonging to that group. Such a group can span over all threads in the grid. This way you will be able to synchronize all threads in all blocks:
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <cooperative_groups.h>
cooperative_groups::grid_group g = cooperative_groups::this_grid();
g.sync();
You need a Pascal (compute capability 60) or a newer architecture to synchronize grids. In addition, there are more specific requirements. See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg
Basic functionality, such as synchronizing groups smaller than a thread block down to warp granularity, is supported on all architectures, while Pascal and Volta GPUs enable new grid-wide and multi-GPU synchronizing groups.
Source: https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Before CUDA 9, there was no native way to synchronise all threads from all blocks. In fact, the concept of blocks in CUDA is that some may be launched only after some other blocks already ended its work, for example, if the GPU it is running on is too weak to process them all in parallel.
If you ensure that you don't spawn too many blocks, you can try to synchronise all blocks between themselves, e.g. by actively-waiting using atomic operations. This is however slow, eating up your GPU memory controller, is considered "a hack" and should be avoided.
So, if you don't target Pascal (or newer) architecture, the best way that I can suggest is to simply terminate your kernel at the synchronisation point, and then launch a new kernel which would continue with your job. In most circumstances it will actually perform faster (or at least - with simmilar speeds) than using the mentioned hack.
Upvotes: 57