Why does each thread have its own instruction address counter inside a warp?

Question

Warps in CUDA always include 32 threads, and all of these 32 threads run the same instruction when the warp is running in SM. The previous question also says each thread has its own instruction counter as quoted below.

Then why does each thread need its own instruction address counter if all the 32 threads always execute the same instruction, could the threads inside 1 warp just share an instruction address counter?

Each thread has its own instruction address counter and register state, and carries out the current instruction on its own data

Robert Crovella · Accepted Answer

I'm not able to respond directly to the quoted text, because I don't have the book it comes from, nor do I know the authors intent.

However, an independent program counter per thread is considered to be a new feature in Volta, see figure 21 and caption in the volta whitepaper:

Volta maintains per-thread scheduling resources such as program counter (PC) and call stack (S), while earlier architectures maintained these resources per warp.

The same whitepaper probably does about as good a job as you will find of why this is needed in Volta, and presumably it carries forward to newer architectures such as Turing:

Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. To maximize parallel efficiency, Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity, while the convergence optimizer in Volta will still group together threads which are executing the same code and run them in parallel for maximum efficiency

Because of this, a Volta warp could have any number of subgroups of threads (up to the warp size, 32), which could be at different places in the instruction stream. The Volta designers decided that the best way to support this flexibility was to provide (among other things) a separate PC per thread in the warp.

Why does each thread have its own instruction address counter inside a warp?

Answers (1)

Related Questions