Reputation: 458
I'm reading "Professional CUDA C Programming" by Cheng et al. and there are examples of how a (very simple, single-line) kernel is being run for example with <<<1024, 512>>>
performs worse than one with <<<2048, 256>>>
. And then they state (several times) that you might have expected this result because the second run has more blocks and therefore exposes more parallelism. I can't figure out why though. Isn't the amount of parallelism governed by the number of concurrent warps in the SM? What does block size have to do with that - it doesn't matter to which block these warps belong to - the same block or different blocks, so why would using smaller blocks expose more parallelism (on the contrary, if the block size is too small I'd hit the max blocks per SM limit, resulting in fewer concurrent warps)? The only scenario I can envision is blocks of 1024 threads = 32 warps on Fermi, which has a max of 48 concurrent warps per SM limit. This means that only 1 concurrent block, and only 32 concurrent warps are possible, reducing the amount of parallelism, but that's a very specific use case.
UPDATE: Another thing I thought of after posting: a block can not be evicted from the SM until all of the warps in it have finished. Thus, at the end of the execution of that block there could be a situation where a few last "slowest" warps are holding the entire block in the SM with most of the warps in that block finished and stalled, but a new block cannot be loaded until those few executing warps are finished. So in this case the efficiency becomes low. Now if the blocks are smaller then this will still happen, but the number of stalled relative to executing warps is smaller hence the efficiency is higher. Is this it?
Upvotes: 8
Views: 1459
Reputation: 3438
Yes, this is it. The second paragraph in your question is a good answer.
In more detail, the number of warp schedulers inside every SM is limited (usually 2). Each warp scheduler keeps track of a number of active warps, and schedules a warp for execution only if the warp is allowed to move further in the program. The number of active warps being tracked by a warp scheduler has a maximum (usually 32). Because the resources owned by the thread block (such as shared memory) cannot be released for a new thread block until all the warps finish, a large block size can cause reduced number of candidate active warps to be available to the scheduler if a few warps take a long time to finish. This can result in reduced performance either due to the resource idleness or the SM inability to cover the latency of memory accesses. Bigger block size also increases the probability of warp blockage when synchronizing across the thread block using __syncthreads()
or one of its variations, therefore, may lead to a similar phenomenon.
Upvotes: 6