CUDA: GPU Working

Question

I have a basic question for my understanding. I apologize if some reference to the answer is provided in some documentation. I couldn't find anything related to this in C programming guide.

I have a Fermi Achitecture GPU Geforce GTX 470. It has
14 Streaming Multiprocessors 32 Stream Cores per SM

I wanted to understand thread per-emption mechanism with an example. Suppose I have simplest kernel with a 'printf' statement (printing out thread id). And I use the following dimensions for grid and blocks

dim3 grid, block;
grid.x = 14;
grid.y = 1;
grid.z = 1;

block.x = 32;
block.y = 1;
block.z = 1;

So as I understand 14 blocks will be scheduled to 14 streaming multi-processors. And as each streaming multiprocessor has 32 cores, each core will execute one kernel (one thread). Is this correct?

If this is correct, then what happens in the following case?

grid.x = 14;
grid.y = 1;
grid.z = 1;

block.x = 64;
block.y = 1;
block.z = 1;

I understand that whatever number of blocks I assign to the grid they will scheduled without any sequence or prediction. That is because as soon as there is a resource bottle neck encountered GPU will schedule those blocks with do not require those resources.

1) Is the same criteria for scheduling threads is used.
2) But like I mentioned I have a printf statement and no common resource usage what happens in that case? After the 32 threads are executed rest of the 32 threads are executed? 3) If I also have a y-dimension in block, whats the sequence then? Is it first 32 threads in x-dimension for all y-dimension are done and then the rest?

Can someone please comment on this?

Roger Dahl · Accepted Answer

So as I understand 14 blocks will be scheduled to 14 streaming multi-processors.

Not necessarily. A single block with 32 threads is not enough to saturate an SM, so multiple blocks may be scheduled on a single SM while some go unused. As you increase the number of blocks, you will get to a point where they get evenly distributed over all SMs.

And as each multiprocessor has 32 cores, each core will execute one kernel (one thread).

The CUDA cores are heavily pipelined so each core processes many threads at the same time. Each thread is in a different stage in the pipeline. There are also a varying number of different types of resources.

Taking a closer look at the Fermi SM (see below), you see the 32 CUDA Cores (marketing speak for ALUs), each of which can hold around 20 threads in their pipelines. But there are only 16 LD/ST (Load/Store) units and only 4 SFU (Special Function) units. So, when a warp gets to an instruction that is not supported by the ALUs, the warp will be scheduled multiple times. For instance, if the instruction requires the SFU units, the warp will be scheduled 8 (32 / 4) times.

I understand that whatever number of blocks I assign to the grid they will scheduled without any sequence or prediction. That is because as soon as there is a resource bottle neck encountered GPU will schedule those blocks with do not require those resources.

1) Is the same criteria for scheduling threads is used.

Because the CUDA architecture guarantees that all threads in a block will have access to the same shared memory, a block can never move between SMs. When the first warp for a block has been scheduled on a given SM, all other warps in that block will be run on that same SM regardless of resources becoming available on other SMs.

2) But like I mentioned I have a printf statement and no common resource usage what happens in that case? After the 32 threads are executed rest of the 32 threads are executed?

Think of blocks as sets of warps that are guaranteed to run on the same SM. So, in your example, the 64 threads (2 warps) of each block will be executed on the same SM. On the first clock, the first instruction of one warp is scheduled. On the second clock, that instruction has moved one step into the pipelines so the resource that was used is free to accept either the second instruction from the same warp or the first instruction from the second warp. Since there are around 20 steps in the ALU pipelines on Fermi, 2 warps will not contain enough explicit parallelism to fill all the stages in the pipeline and they will probably not contain enough ILP to do so.

3) If I also have a y-dimension in block, whats the sequence then? Is it first 32 threads in x-dimension for all y-dimension are done and then the rest?

The dimensions are only to enable offloading of generation of 2D and 3D thread indexes to dedicated hardware. The schedulers see the blocks as a 1D array of warps. The order in which they search for eligible warps is undefined. The scheduler will search in a fairly small set of "active" warps for a warp that has a current instruction that needs a resource that is currently open. When a warp is complete, a new one will be added to the active set. So, the order in which the warps are completed becomes unpredictable.

Fermi SM:

Fermi SM

CUDA: GPU Working

Answers (1)

Related Questions