GPU Utilization Interpretation

Question

I have tested a kernel with the NVIDIA Profiler which outputs the following:

enter image description here

We have launched the kernel with 256 blocks and 256 threads. As far as I understand, the grafic shows three sections, one for Warps , one for Registers, one for Shared Memory and each section has a calculated "Block Limit", from which the one in the register section is the smallest and most limiting value. Obviously the kernel is register bound and we can launch only 4 blocks simultaneously on one SM. That is what the Profiler says. I am utterly asking myself about the following thing:

A GTX 780 Ti, has 192 cores in one SM, how is it possible that 4 blocks * 256 threads = 1024 Threads can be launched simultaneously? What does this "simultaneously" mean anyway in CUDA terms? Does it mean, that 4 blocks can be scheduled in the scheduler at the same time and the SM executes instructions in lock-step fashion from warps of one single block at the time. The word simultaneously is somewhat confuse?

Thanks a lot

Robert Crovella · Accepted Answer

The GPU is a latency-hiding machine, and the latency-hiding involves scheduling various threads (instructions) on various execution units, at each cycle/issue slot. In order to hide latency best, the GPU likes to have many more available threads to choose instructions from than there are execution units on the SM.

So in a given cycle, it might be that there are only 192 (or fewer, probably) execution units scheduled by the warp scheduler(s), but in the very next cycle/issue slot, more instructions can be scheduled. In order to facilitate this process, we want to have as many threads/warps/blocks "available" for scheduling, as possible. The "simultaneous" here refers to the number of threads/warps/blocks that are "open" on the SM and "available to be scheduled". It does not refer to how many actual instructions get issued in any issue slot, nor does it refer to the number of "cores" or "execution units" on the SM.

The number of threads/warps/blocks that can be "opened" on the SM at any given time (so as to be available for scheduling purposes) may be limited by the resource usage of the threadblocks in question. Threadblocks with high register usage, for example, may limit the total number of threadblocks that can be "open" on the SM, because the SM must allocate a full register set to every "open" threadblock.

Since GTX 780 Ti uses the GK110 GPU, the GK110 white paper may be of interest.

GPU Utilization Interpretation

Answers (1)

Related Questions