Reputation: 10176
I have a CUDA program with huge memory accesses, which are 'randomly' and thus NOT coalesced at all. Now when I bench this program for different kernel-runtimeparameters and choose the blocksize always a multiple of 7 (starting from 7 to let's say 980) and the threadsPerBlock always a multiple of the warpsize (starting from 32 to let's say 1024) there's NO difference in the runtime of the program. How could one explain that?
Thanks a lot!
Upvotes: 1
Views: 193
Reputation: 6675
Influence of thread block size is minimal. It's the last optimization I would try (and only if occupancy is egregiously bad, Fermi class has virtually same performance whenever occupancy is over 50% or so). If your kernel is really bad, then you won't notice any differences at all.
Also, you can run the CUDA Visual Profiler on your Matlab code. With GPU coding, profile everything.
Follow these steps in the session setup.
That said, from personal experience, see if you can use texture memory to do some caching. Even if the memory accesses are not coalesced, you may nevertheless get some cache hits from memory locality.
Upvotes: 1