Warp and block scheduling in CUDA - what exactly happens, and questions about eligible warps

Question

I understand how warps and blocks are scheduled in CUDA - but not how these two scheduling arrangements come together. I know that once there is enough execution resources in an SM to support a new block, a new block is executed and I know that eligible warps are selected to be executed every clock cycle (if the spare execution resources allow). However, what exactly makes a warp "eligible"? And what if there are enough execution resources to support a new warp - but not a new block? Does the block scheduling include warp scheduling? Help will be highly appreciated, thanks!

Robert Crovella · Accepted Answer

Does the block scheduling include warp scheduling?

The block scheduler and the warp scheduler should be thought of as 2 separate entities. In fact I would view the block-scheduler as a device-wide entity whereas the warp scheduler is a per-SM entity.

You can imagine that there may be a "queue" of blocks associated with each kernel launch. As resources on a SM become available, the block scheduler will deposit a block from the "queue" onto that SM.

With that description, block scheduling does not include warp scheduling.

However, what exactly makes a warp "eligible"?

We're now considering a block that is already deposited on a SM. A warp is "eligible" when it has one or more instructions that are ready to be executed. The opposite of "eligible" is "stalled". A warp is "stalled" when it has no instructions that are ready to be executed. The GPU profiler documentation describes a variety of possible "stall reasons"(*), but a typical one would be a dependency: An instruction that depends on the results of a previous instruction (or operation, such as a memory read) is not eligible to be issued until the results from the previous instruction/operation are ready. Also note that the GPU currently is not an out-of-order machine. If the next instructions to be executed are currently stalled, the GPU does not search (very far) into the subsequent instruction stream for possible independently executable instructions.

And what if there are enough execution resources to support a new warp - but not a new block?

That doesn't provide anything useful. In order to schedule a new block (i.e. for the block scheduler to deposit a new block on a SM) there must be enough resources available for the entire block. (The block scheduler does not deposit blocks warp-by-warp. It is an all-or-nothing proposition, on a block-by-block basis.)

(*) There is one "stall reason" called "not selected", which does not actually indicate the warp is stalled. It means that the warp is in fact eligible, but it was not selected for instruction dispatch on that cycle, usually because the warp scheduler(s) chose instruction(s) from other warp(s), to issue in that cycle.

Warp and block scheduling in CUDA - what exactly happens, and questions about eligible warps

Answers (1)

Related Questions