Dipendra Kumar Mishra
Dipendra Kumar Mishra

Reputation: 311

CUDA performance with respect to threads per block

I am experimenting with CUDA for interests. In one of the experiment I had a small kernel which was only running in for loop for 10million times. I send 1 block and then increased the number of threads per block from 1-1024. Then I plotted the execution to see how it varies. The results is that there is sharp rise at around 350 thread per block and then there are sharp rise at gradual moments. The execution time becomes 2x at 1024 thread per block indicating that atleast one thread has been blocked. The actual graph is like an increasing ladder. What I want to understand is why and on what number these rise depend. I am trying to understand it wrt to number of SM,cuda cores etc.

I am using GeForce 560 Ti with 8SM, 48cores per SM and 2 warp scheduler per SM.

Upvotes: 2

Views: 778

Answers (1)

chaohuang
chaohuang

Reputation: 4115

One possible reason for sharpe rise at 350 threads per block is that the block consumes too much resources so that a SM cannot process more than one block at a time. You can use CUDA Occupancy Calculator see how many blocks will be handled by one SM at a time based on the resource usage of your kernel.

Upvotes: 2

Related Questions