einpoklum
einpoklum

Reputation: 132108

Are threads in a multi-dimensional CUDA kernel blocks packed to fill warps?

NVIDIA GPUs have schedule complete warps to execute instructions together (well, sort of; see also this question). Thus, if we have a "linear" block of, say, 90 threads (or X x Y x Z = 90 x 1 x 1) - a GPU core will have three warps to schedule instruction execution for:

enter image description here

This is straightforward and obvious. But - what happens if we have a multi-dimensional block, whose X dimension is not a multiple of 32? Say, X x Y x Z = 30 x 3 x 1 ? There are at least two intuitive ways these could be broken up into warps.

Option 1 - pack threads into full warps:

enter image description here

Option 2 - keep threads with different z, y coordinates in separate warps:

enter image description here

The first option potentially requires less warps (think of the case of 16 x 2 x 1 blocks); the second option is likely to prevent some divergence within warps - although this depends on the specifics of the kernel code.

My questions:

  1. If I don't try to specify anything about the aggregation into warps - which option is chosen by default? And does this differ by GPU/driver?
  2. Can I affect which of the two options is chosen, or otherwise affect the aggregation of threads into warps in a multidimensional block?

Upvotes: 1

Views: 685

Answers (1)

einpoklum
einpoklum

Reputation: 132108

tl;dr: CUDA packs full warps.

Deducing this from the programming guide

(Thanks @RobertCrovella)

Section §4.1 of the CUDA Programming API says:

The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy (§2.2) describes how thread IDs relate to thread indices in the block.

Section §2.2 of the CUDA Programming API says:

The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).

So, the multi-dimensional "thread index" is linearized in a straightforward manner into a unidimensional "thread ID", and those are just packed into warps in-order.

Seeing this for yourself

You can check the partition-into-warps using the following program:

__global__ void test_kernel()
{
    unsigned active_lanes = __activemask();
    printf("Thread (%2u,%2u): Active lane mask %8X\n",
        threadIdx.x, threadIdx.y, active_lanes);
};

int main()
{
    cudaSetDevice(0);
    test_kernel<<<1,{31,2,1}>>>();
    cudaDeviceSynchronize();
}

If warps are fully packed, you'll have a warp's worth of threads with the full mask (0xFFFFFFFF), and 30 threads with a 30-thread mask (0x3FFFFFFF). Otherwise, the pattern will be different.

... and indeed, we get the first option.

"But I want option 2!"

Well, if you want separate warps for different Y and Z axis coordinates, is to "pad" your block dimensions so that the first (X-axis) dimension is always a multiple of the warp size, 32. This has, of course, the cost of having to perform an extra check:

if (threadIdx.x >= unpadded_x_block_size) { return; }

but that's not very expensive (especially if you use threadIdx.x elsewhere, and if you can calculate unpadded_x_block_size at kernel compile time.)

Upvotes: 1

Related Questions