Reputation: 18870
I understand that in CUDA, 32 adjacent threads in the same block will be scheduled as a warp. But I frequently finds some tutorial CUDA codes that has multiple blocks with 1 thread per block. In this model, will 32 threads from 32 block be scheduled as a warp? If not, can I say this model is not as efficient as organizing into 32-threads per block? Thanks!
Upvotes: 5
Views: 328
Reputation: 453
One more point to add. Computation in CUDA ALWAYS happens via warps, so even if you allocate less than 32 threads per block(1,2..8,16), computation happens for a warp (32 threads), resources are stalled for 32 threads for that block.
If you are allocating 32 blocks with one thread each, you are stalling resources for 32X32 threads. Avoid this if you can.
Upvotes: 0
Reputation: 151859
No, threads from different blocks cannot be scheduled in the same warp. If you create grids of threadblocks with only a single thread, you're definitely not getting the full performance from the machine. It's less efficient than having 32 (or an integer multiple of 32) threads per block. A Fermi SM, for example has 32 warp lanes that can be in use. If you are scheduling blocks of a single thread, then only 1 of those 32 lanes can be in use at any given time.
Threads have a thread ID (threadIdx built-in variable) which is defined within (and unique only to) a single block.
The Hardware multithreading section of the C programming guide gives a formula which defines the total number of warps in a single block.
Upvotes: 6