How are warps partitioned across cores in NVIDIA GPUs?

Question

There are many posts about how CUDA threads and blocks get mapped to GPU hardware, but I cannot find a clear answer to this question. What are the rules by which warps are partitioned across cores, if any?

I know that multiprocessors containing some number of cores each receive one or several thread blocks to process. These blocks are partitioned into warps (32 threads each), and then deployed to different cores, but what are the rules by which warps are then mapped to cores? Is it always one warp per core, or something else? Can a core process several warps? Fractions of a warp?

Greg Smith · Accepted Answer

CUDA cores are integer/floating point math pipelines and as such the partitioning implied by the term core is deceiving. Each SM has 1-4 warp schedulers. Each warp scheduler has a fixed number of dispatch units. Each dispatch unit can dispatch to specific pipelines which include CUDA cores (int/fp), double precision units, load store units, branch units, special function units, and texture units. The pipelines can have different widths which can be determined by the pipeline throughput. All threads in a warp are issued to the same pipeline. The instruction may be issued over multiple cycles.

The GPU pipelines are fairly deep. Only one warp can be at a given stage of a specific pipeline; however, multiple warps may be active in the pipeline. For example, warp 1 may be in ALU.execute and warp 2 may be in ALU.write_back stages.

How are warps partitioned across cores in NVIDIA GPUs?

Answers (2)

Related Questions