yyyy
yyyy

Reputation: 57

Is there a correlation between the exact meaning of gpu wave and thread block?

computation performed by a GPU kernel is partitioned into groups of threads called thread blocks, which typically execute in concurrent groups, resulting in waves of execution

What exactly does wave mean here? Isn't that the same meaning as warp ?

Upvotes: 1

Views: 1996

Answers (2)

Akif Aydogmus
Akif Aydogmus

Reputation: 53

  • Wave: a group of thread blocks running concurrently on GPU.

  • Full Wave: (number of SMs on the device) x (max active blocks per SM)

Launching the grid with thread-blocks less than a full wave results in low achieved occupancy. Mostly launching is composed of some number of full wave and possibly 1 incomplete wave. It should be to mention that maximum size of the wave is based on how many blocks can fit on one SM regarding registers per thread, shared memory per block etc.

If we look at the blog of the Julien Demoth and use that values to understand the issue:

  • max # of threads per SM: 2048 (NVIDIA Tesla K20)

  • kernel has 4 blocks of 256 threads per SM

  • Theoretical Occupancy: %50 (4*256/2048)

  • Full Wave: (# of SMs) x (max active blocks per SM) = 13x4 = 52 blocks

The kernel is launching with 128 blocks so there are 2 full wave and 1 incomplete wave with 24 blocks. The full wave value may be increased using the attribute (launch_bounds) or configuring the amount of shared memory per SM (for some device, see also related report) etc.

Also, the incomplete wave is named as partial last wave and it has negative effect on performance due to having low occupancy. This underutilization of GPU is named as tail effect and it’s dominant especially when launching few thread blocks in a grid.

Upvotes: 2

Homer512
Homer512

Reputation: 13295

A GPU can execute a maximum number of threads, grouped in a maximum number of thread blocks. When the whole grid for a kernel is larger than the maximum of either of those limits, or if there are concurrent kernels occupying the GPU, it will launch as many thread blocks as possible. When the last thread of a block has terminated, a new block will start.

Since blocks typically have equal run times and scheduling has a certain latency, this often results in bursts of activity on the GPU that you can see in the occupancy. I believe this is what is meant by that sentence.

Do not confuse this with the term "wavefront" which is what AMD calls a warp.

Upvotes: 2

Related Questions