Reputation: 20915
I wrote a simple CUDA kernel to perform SAXPY over two column vectors of size 2^18.
I found that my GPU, a Tesla C2070, could run a maximum of 1024 threads per block. Hence, I made my block size X = 1024, Y = 1, Z = 1. I also made my grid size X = 2^18 / 1024, Y = 1, Z = 1. I did this because I wanted to make sure that every single thread per block was being used.
However, I discovered that running the kernel with block sizes of X = 512 and X = 128 consistently resulted in faster times than running the kernel with a block size of X = 1024.
Why is that? Aren't I wasting threads if my block size is less than 1024?
Upvotes: 3
Views: 785
Reputation: 468
For a code that uses shared memory for caching reads/writes/data shares, smaller block size may result in using bigger shared memory block per thread, which in turn increases chance of good memory access pattern (more coalescing).
I agree with talonmies that in my experience, 128-192 threads per block nearly always result in best performance for my code, even if it is possible to launch more threads.
Upvotes: 1
Reputation: 72351
Level 1 BLAS functions like SAXPY are memory bandwidth limited. The operation
y <- alpha * x + y
only performs a single FMAD, but requires two loads and a store from global memory. Your C2070 has about 37.5Gfloat/s of global memory bandwidth and 500 GFMAD/s of single precision arithmetic throughput. So performance is determined by the memory controller, rather than the ALUs. Often reducing the number of threads per block in memory bandwidth limited kernels improves performance because it reduces contention for memory controller and cache resources and increases bandwidth utilisation.
This is probably what is happening with your SAXPY kernel. You should be able to find the optimal blocksize by benchmarking, but my experience is that it will be in the 128-384 threads per block on a Fermi device like your C2070.
Upvotes: 3