Reputation: 21
On my GPU, with Compute Capability 2.0, the maximum number of threads per multiprocessor is 1536. Why is it not a power of 2?
Here are some details for my GPU:
Physical Limits for GPU Compute Capability: 2.0
Threads per Warp 32
Max Warps per Multiprocessor 48
Max Thread Blocks per Multiprocessor 8
Max Threads per Multiprocessor 1536
Maximum Thread Block Size 1024
Registers per Multiprocessor 32768
Max Registers per Thread Block 32768
Max Registers per Thread 63
Shared Memory per Multiprocessor (bytes) 16384
Max Shared Memory per Block 16384
Register allocation unit size 64
Register allocation granularity warp
Shared Memory allocation unit size 128
Warp allocation granularity 2
Upvotes: 1
Views: 2260
Reputation: 1933
Threads per Warp
x Max Warps per Multiprocessor
= Max Threads per Multiprocessor
32 x 48 = 1536
Max Warps per Multiprocessor
actually means Maximum number of **resident** warps per multiprocessor
, and Max Threads per Multiprocessor
is Maximum number of **resident** threads per multiprocessor
.
Check this out. In Table 14, you will see that the above rule applies to every compute capability.
The number 1536 means that each multiprocessor (called SM for Streaming Processor in cuda) can have maximum of 1536 active threads. It doesn't mean that you can only launch 1536 threads. You can launch much more than 1536 threads in a call to CUDA kernel, but each SM can only contain 1536 threads. Also, it doesn't mean that 1536 threads are physically executing at the same time. Warp is the unit of execution, which is 32 in all generations of CUDA up to today.
Following quote is from here.
By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 1536 active threads concurrently per multiprocessor (see Features and Specifications of the CUDA C Programming Guide) On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.
EDIT
The additional question is:
Could you also highlight why the Max Warps per Multiprocessor is 48 and not a power of 2 (since the number of cores and register size = 65536 bytes are all powers of two)?
The number of cores per SM is not always a power of two. Also there's some subtle difference between a CPU core and a CUDA core. Take devices with compute capability 3.x for example(link).
A multiprocessor consists of:
- 192 CUDA cores for arithmetic operations,
- 32 special function units for single-precision floating-point transcendental functions,
- 4 warp schedulers.
As you can see, the number of CUDA cores(192
) is not a power of 2, and whereas a CPU core is general, a CUDA core doesn't perform single-precision floating-point transcendental functions. Those operations are handled by some other special function units. Check this out.
Also, in your question it says Registers per Multiprocessor
is 32K. It means there are 32K 32-bit registers per SM. So the total register size is 128KB.
Given all that, I don't think there's a reason for the Max Warps per Multiprocessor
to be a power of 2.
Upvotes: 7