Are threads in a multi-dimensional CUDA kernel blocks packed to fill warps?

Question

NVIDIA GPUs have schedule complete warps to execute instructions together (well, sort of; see also this question). Thus, if we have a "linear" block of, say, 90 threads (or X x Y x Z = 90 x 1 x 1) - a GPU core will have three warps to schedule instruction execution for:

threads (0,0,0) through (31,0,0)
threads (32,0,0) through (63,0,0)
threads (64,0,0) through (89,0,0) + 6 unused lanes

This is straightforward and obvious. But - what happens if we have a multi-dimensional block, whose X dimension is not a multiple of 32? Say, X x Y x Z = 30 x 3 x 1 ? There are at least two intuitive ways these could be broken up into warps.

Option 1 - pack threads into full warps:

threads (0,0,0) through (29,0,0) + (0,1,0) through (1,1,0)
threads (2,1,0) through (30,1,0) + (0,2,0) through (3,2,0)
threads (4,2,0) through (30,2,0) + 6 unused lanes

Option 2 - keep threads with different z, y coordinates in separate warps:

threads (0,0,0) through (29,0,0) + 2 unused lanes
threads (0,1,0) through (29,1,0) + 2 unused lanes
threads (0,2,0) through (29,2,0) + 2 unused lanes

The first option potentially requires less warps (think of the case of 16 x 2 x 1 blocks); the second option is likely to prevent some divergence within warps - although this depends on the specifics of the kernel code.

My questions:

If I don't try to specify anything about the aggregation into warps - which option is chosen by default? And does this differ by GPU/driver?
Can I affect which of the two options is chosen, or otherwise affect the aggregation of threads into warps in a multidimensional block?

Are threads in a multi-dimensional CUDA kernel blocks packed to fill warps?

Answers (1)

tl;dr: CUDA packs full warps.

Deducing this from the programming guide

Seeing this for yourself

"But I want option 2!"

Related Questions