stonestrong
stonestrong

Reputation: 347

How do a SM in CUDA run multiple blocks simultaneously?

In CUDA, can a SM run multiple blocks simultaneously if each block won't cost too much resource.

On Fermi, we know that a SM consists of 32kb register space for use. suppose a thread use 32 register, so this SM can lanuch one block which contains 256 ((32*1024)/(32*4)) threads. If SM can run multiple blocks simultaneously, we can also configure 32 theards for a block, and 8 block for the SM. Is there any difference?

Upvotes: 1

Views: 1988

Answers (1)

Roger Dahl
Roger Dahl

Reputation: 15734

As @talonmies commented, your math is not entirely correct. But the key point is that an SM contains a balance of many different types of resources. The better your kernel and kernel launch parameters fit with this balance, the better your performance.

I haven't checked the numbers for Kepler (compute capability 3.x) but for Fermi (2.x), an SM can keep track of 48 concurrent warps (1,536 threads) and 8 concurrent blocks. This means that if you chose a low thread count for your blocks, the 8 concurrent blocks becomes the limiting factor to occupancy in your kernel. For instance, if you chose 32 threads per block, you get up to 256 (8 * 32) concurrent threads running on the SM while the SM can run up to 1,536 threads (48 * 32).

In the occupancy calculator, you can see what the different hardware limits are and it will tell you which of them becomes the limiting factor with your specific kernel. You can experiment with variations in launch parameters, shared memory usage and register usage to see how they affect your occupancy.

Occupancy is not everything when it comes to performance. Increased occupancy translates to increased ability to hide the latency of memory transfers. When the memory bandwidth is saturated, increasing occupancy further does not help. There is another effect in play as well. Increasing the size of a block may decrease occupancy but at the same time increase the amount of instruction level parallelism (ILP) available in your kernel. In this case, decreasing occupancy can increase performance.

Upvotes: 3

Related Questions