user2118922
user2118922

Reputation: 21

Why is Max Threads per Multiprocessor 1536 on my Compute Capability 2.0 GPU?

On my GPU, with Compute Capability 2.0, the maximum number of threads per multiprocessor is 1536. Why is it not a power of 2?

Here are some details for my GPU:

Physical Limits for GPU Compute Capability: 2.0   
Threads per Warp                            32  
Max Warps per Multiprocessor                48  
Max Thread Blocks per Multiprocessor        8  
Max Threads per Multiprocessor              1536  
Maximum Thread Block Size                   1024  
Registers per Multiprocessor                32768  
Max Registers per Thread Block              32768  
Max Registers per Thread                    63  
Shared Memory per Multiprocessor (bytes)    16384  
Max Shared Memory per Block                 16384  
Register allocation unit size               64  
Register allocation granularity             warp  
Shared Memory allocation unit size          128  
Warp allocation granularity                 2  

Upvotes: 1

Views: 2260

Answers (1)

nglee
nglee

Reputation: 1933

Threads per Warp x Max Warps per Multiprocessor = Max Threads per Multiprocessor

32 x 48 = 1536

Max Warps per Multiprocessor actually means Maximum number of **resident** warps per multiprocessor, and Max Threads per Multiprocessor is Maximum number of **resident** threads per multiprocessor.

Check this out. In Table 14, you will see that the above rule applies to every compute capability.

The number 1536 means that each multiprocessor (called SM for Streaming Processor in cuda) can have maximum of 1536 active threads. It doesn't mean that you can only launch 1536 threads. You can launch much more than 1536 threads in a call to CUDA kernel, but each SM can only contain 1536 threads. Also, it doesn't mean that 1536 threads are physically executing at the same time. Warp is the unit of execution, which is 32 in all generations of CUDA up to today.

Following quote is from here.

By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 1536 active threads concurrently per multiprocessor (see Features and Specifications of the CUDA C Programming Guide) On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.


EDIT

The additional question is:

Could you also highlight why the Max Warps per Multiprocessor is 48 and not a power of 2 (since the number of cores and register size = 65536 bytes are all powers of two)?

The number of cores per SM is not always a power of two. Also there's some subtle difference between a CPU core and a CUDA core. Take devices with compute capability 3.x for example(link).

A multiprocessor consists of:

  • 192 CUDA cores for arithmetic operations,
  • 32 special function units for single-precision floating-point transcendental functions,
  • 4 warp schedulers.

As you can see, the number of CUDA cores(192) is not a power of 2, and whereas a CPU core is general, a CUDA core doesn't perform single-precision floating-point transcendental functions. Those operations are handled by some other special function units. Check this out.

Also, in your question it says Registers per Multiprocessor is 32K. It means there are 32K 32-bit registers per SM. So the total register size is 128KB.

Given all that, I don't think there's a reason for the Max Warps per Multiprocessor to be a power of 2.

Upvotes: 7

Related Questions