Reputation: 132108
NVIDIA GPUs have schedule complete warps to execute instructions together (well, sort of; see also this question). Thus, if we have a "linear" block of, say, 90 threads (or X x Y x Z = 90 x 1 x 1) - a GPU core will have three warps to schedule instruction execution for:
This is straightforward and obvious. But - what happens if we have a multi-dimensional block, whose X dimension is not a multiple of 32? Say, X x Y x Z = 30 x 3 x 1 ? There are at least two intuitive ways these could be broken up into warps.
Option 1 - pack threads into full warps:
Option 2 - keep threads with different z, y coordinates in separate warps:
The first option potentially requires less warps (think of the case of 16 x 2 x 1 blocks); the second option is likely to prevent some divergence within warps - although this depends on the specifics of the kernel code.
My questions:
Upvotes: 1
Views: 685
Reputation: 132108
(Thanks @RobertCrovella)
Section §4.1 of the CUDA Programming API says:
The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy (§2.2) describes how thread IDs relate to thread indices in the block.
Section §2.2 of the CUDA Programming API says:
The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).
So, the multi-dimensional "thread index" is linearized in a straightforward manner into a unidimensional "thread ID", and those are just packed into warps in-order.
You can check the partition-into-warps using the following program:
__global__ void test_kernel()
{
unsigned active_lanes = __activemask();
printf("Thread (%2u,%2u): Active lane mask %8X\n",
threadIdx.x, threadIdx.y, active_lanes);
};
int main()
{
cudaSetDevice(0);
test_kernel<<<1,{31,2,1}>>>();
cudaDeviceSynchronize();
}
If warps are fully packed, you'll have a warp's worth of threads with the full mask (0xFFFFFFFF), and 30 threads with a 30-thread mask (0x3FFFFFFF). Otherwise, the pattern will be different.
... and indeed, we get the first option.
Well, if you want separate warps for different Y and Z axis coordinates, is to "pad" your block dimensions so that the first (X-axis) dimension is always a multiple of the warp size, 32. This has, of course, the cost of having to perform an extra check:
if (threadIdx.x >= unpadded_x_block_size) { return; }
but that's not very expensive (especially if you use threadIdx.x elsewhere, and if you can calculate unpadded_x_block_size at kernel compile time.)
Upvotes: 1