Reputation: 1090

optimal number of CUDA parallel blocks

Can there be any performance advantage to launch a grid of blocks simultaneously over launching blocks one at a time if the number of threads in each block is already larger than the number of CUDA cores?

Upvotes: 2

Answers (2)

Greg Smith

Reputation: 11509

Launch Latency

Launch latency (API call to work is started on the GPU) is of a grid is 3-8 µs on Linux to 30-80 µs on Windows Vista/Win7.

Distributing a block to a SM is 10-100s ns.

Launching a warp in a block (32 threads) is a few cycles and happens in parallel on each SM.

Resource Limitations

Concurrent Kernels - Tesla N/A only 1 grid at a time - Fermi 16 grids at a time - Kepler 16 grids (Kepler2 32 grids)

Maximum Blocks (not considering occupancy limitations) - Tesla SmCount * 8 (gtx280 = 30 * 8 = 240) - Fermi SmCount * 16 (gf100 = 16 * 16 = 256) - Kepler SmCount * 16 (gk104 = 8 * 16 = 128)

See occupancy calculator for limitations on threads per block, threads per SM, registers per SM, registers per thread, ...

Warps Scheduling and CUDA Cores

CUDA cores are floating point/ALU units. Each SM has other types of execution units including load/store, special function, branch, etc. A CUDA core is equivalent to a SIMD unit in a x86 processor. It is not equivalent to a x86 core.

Occupancy is the measure of warps per SM to the maximum number of warps per SM. The more warps per SM the higher the chance that the warp scheduler has an eligible warp to schedule. However, the higher the occupancy the less resources will be available per thread. As a basic goal you want to target more than

25% 8 warps on Tesla 50% or 24 warps on Fermi 50% or 32 warps on Kepler (generally higher)

You'll notice there is no real relationship to CUDA cores in these calculations.

To understand this better read the Fermi whitepaper and if you can use the Nsight Visual Studio Edition CUDA Profiler look at the Issue Efficiency Experiment (not yet available in the CUDA Profiler or Visual Profiler) to understand how well your kernel is hiding execution and memory latency.

Upvotes: 1

tropicana

Reputation: 1433

I think there is; A thread block is assigned to a Streaming Multiprocessor (SM) and the SM further divides the threads of each block into warps of 32 threads (newer architectures can handle larger warps) that are scheduled to be executed (more-less) sequentially. Considering this, it will be faster to break each computation into blocks so that they occupy as many SMs as possible. It is also meaning full to build blocks that are multiples of the threads per warp that the card supports (a block of 32 or 64 threads rather than 40 threads, for the case that SMs use 32-thread warps).

Upvotes: 5

optimal number of CUDA parallel blocks

Answers (2)

Related Questions