Reputation: 3026
My understanding of the differences between CPUs and GPUs is that the GPUs are not general purpose processors such that if a video card contains 10 GPUs, each GPU actual share the same program pointer and to optimize parallelism on the GPU I need to ensure each GPU is actually running the same code.
Synchronisation is not a problem on the same card since each GPU is physically running in parallel so they should all complete at the same time.
My question is, how does this work on multiple cards? At the speed at which they operate at, doesn't the hardware make a slight difference in execution times such that a calculation on one GPU on one card may end quicker or slower than the same calculation on another GPU on another card?
thanks
Upvotes: 0
Views: 1223
Reputation: 2318
I think you may be confused about how threads work on a GPU. First to address the issue of multiple GPUs. Multiple GPUs NEVER share the program pointer, so they will almost never complete a kernel at the same time.
On a single GPU, only threads that are executing ON THE SAME COMPUTE UNIT (or SM in NVIDIA parlance) AND are part of the same warp/wavefront are guaranteed to execute in sync. You can never really count on this, but for some devices the compiler can determine that will be the case (I am specifically thinking about some AMD devices, as long as the worgroup size is hardcoded to 64).
In any case, as @vocaro pointed out, that's why you need to use a barrier for local memory. To emphasize, even on the same GPU, threads are not executing in parallel across the whole device - only within each compute unit.
Upvotes: 2
Reputation: 2779
Synchronisation is not a problem on the same card since each GPU is physically running in parallel so they should all complete at the same time.
This is not true. Different threads on a GPU may complete at different times due to differences in memory access latency, for example. That is why there are synchronization primitives in OpenCL such as the barrier
command. You can never assume that your threads are running precisely in parallel.
The same is true for multiple GPUs. There is no guarantee that they are in sync, so you will need to rely on API calls such as clFinish to explicitly synchronize their work.
Upvotes: 3