How do GPU cores communicate with each other?

Question

GPUs, when used for general-purpose computing, put a lot of emphasis on fine-grained parallelism with SIMD and SIMT. They perform best on regular numbercrunching workloads with high arithmetic intensity.

Nonetheless, to be applicable to as many workloads as they have been applied to, they must also be capable of coarse-grained MIMD parallelism, where different cores execute different instruction streams on different chunks of data.

This means different cores on the GPU must synchronize with each other after executing different instruction streams. How do they do it?

On a CPU the answer would be that there is cache coherence plus a set of communication primitives chosen to work well with that such as CAS or LL/SC. But as I understand it, GPUs do not have cache coherence - avoiding the overhead of such is the biggest reason they are more efficient than CPUs in the first place.

So what method do GPU cores use for synchronizing with each other? If the answer to how they exchange data is by writing to shared main memory, then how do they synchronize so the sender can inform the recipient when to read the data?

If the answer depends on the particular architecture, then I'm particularly interested in modern Nvidia GPUs that support CUDA.

Edit: From the document Booo linked, here is my understanding so far:

They seem to use the word 'stream' for a quantity of stuff that gets done synchronously (including fine-grained parallelism like SIMD); the problem is then how to synchronize/communicate between multiple streams.

As I surmised, this is much more explicit than it is on CPUs. in particular, they talk about:

Page-locked memory
cudaDeviceSynchronize ()
cudaStreamSynchronize ( streamid )
cudaEventSynchronize ( event )

So streams can communicate by writing data to main memory (or L3 cache?) and there is nothing like the cache coherence there is on CPUs, instead there is locking pages of memory, and/or an explicit synchronization API.

Booo · Accepted Answer

My understanding is that there are several ways to "synchronise" using CUDA:

CUDA Streams (at the function level): cudaDeviceSynchronize() synchronise across the whole device. In addition you can synchronise a particular stream with cudaStreamSynchronize(cudaStream_t stream), or synchronise a event embedded in some streams with cudaEventSynchronize(cudaEvent_t event). Ref 1, Ref 2.
Cooperative Groups (>CUDA 9.0 and >CC 3.0): you can synchronise at the group level, a group can be a set of coalesced threads, a threadblock, or grids spanning multiple devices. This is much more flexible. Define your own group using

(1) auto group = cooperative_groups::coalesced_threads() for current coalesced set of threads, or

(2) auto group = cooperative_groups::this_thread_block() for current threadblock, you can further define fine-grained groups within the block such as auto group_warp = cooperative_groups::tiled_partition<32>(group), or

(3) auto group = cooperative_groups::this_grid() or auto group = cooperative_groups::this_multi_grid() for grid(s) across multiple devices.

Then, you can just call group.sync() for synchronisation. You need to have a device that support cooperativeLaunch or cooperativeMultiDeviceLaunch through. Note with cooperative groups you can already perform the traditional block level sync in shared memory with __syncthreads(). Ref 1, Ref 2.

How do GPU cores communicate with each other?

Answers (1)

Related Questions