When the kernel parameter of CUDA haven't any influence of the runtime

Question

I have a CUDA program with huge memory accesses, which are 'randomly' and thus NOT coalesced at all. Now when I bench this program for different kernel-runtimeparameters and choose the blocksize always a multiple of 7 (starting from 7 to let's say 980) and the threadsPerBlock always a multiple of the warpsize (starting from 32 to let's say 1024) there's NO difference in the runtime of the program. How could one explain that?

Thanks a lot!

peakxu · Accepted Answer

Influence of thread block size is minimal. It's the last optimization I would try (and only if occupancy is egregiously bad, Fermi class has virtually same performance whenever occupancy is over 50% or so). If your kernel is really bad, then you won't notice any differences at all.

Also, you can run the CUDA Visual Profiler on your Matlab code. With GPU coding, profile everything.

Follow these steps in the session setup.

in Launch specify your Matlab executable.
In Working directory select the directory of your matlab script
in Arguments: -nojvm -nosplash -r name_of_matlab_script (with no .m)

That said, from personal experience, see if you can use texture memory to do some caching. Even if the memory accesses are not coalesced, you may nevertheless get some cache hits from memory locality.

When the kernel parameter of CUDA haven't any influence of the runtime

Answers (1)

Related Questions

When the kernel parameter of CUDA haven&#39;t any influence of the runtime

Answers (1)

Related Questions

When the kernel parameter of CUDA haven't any influence of the runtime