CUDA performance with respect to threads per block

Question

I am experimenting with CUDA for interests. In one of the experiment I had a small kernel which was only running in for loop for 10million times. I send 1 block and then increased the number of threads per block from 1-1024. Then I plotted the execution to see how it varies. The results is that there is sharp rise at around 350 thread per block and then there are sharp rise at gradual moments. The execution time becomes 2x at 1024 thread per block indicating that atleast one thread has been blocked. The actual graph is like an increasing ladder. What I want to understand is why and on what number these rise depend. I am trying to understand it wrt to number of SM,cuda cores etc.

I am using GeForce 560 Ti with 8SM, 48cores per SM and 2 warp scheduler per SM.

CUDA performance with respect to threads per block

Answers (1)

Related Questions