Reputation: 3575
I'm using a GeForce GTX 580 (compute capability 2.0).
In my program i'm suspecting that the bottleneck is access to global memory in the kernel. I suspect this because all the calculations involve numbers gotten by indexing an array stored in global memory, and because switching from double precision to single precision only improves the performance by like 10%. (afaik it should be twice as fast with a fermi device if the floating point operations are the bottleneck (?))
So to improve this bottleneck i thought about memory coalescence. The problem here is that i don't know if i achieved it or not. Either i already have it, and this is as good as it gets (25 times faster than the sequential version on an intel i7), or i might get it to run much faster by somehow rewriting to get coalescence.
But is there a way to know? Can i somehow "turn off" coalescence to find out, or find out in another way?
Upvotes: 2
Views: 720
Reputation: 50927
The CUDA Visual profiler will show you the load/store efficiency of each kernel in the summary table; Grizzly gave a good answer about how this has changed in the newer cards here: Compute Prof's fields for incoherent and coherent gst/gld? (CUDA/OpenCL)
Upvotes: 3
Reputation: 2053
No, memory coalescence is not something you turn on or off, it is something you achieve by using the correct memory access patterns and alignment. I am not sure as I have never used (not working on Windows) but I think nVidia's Parallel Nsight can tell you if your memory accesses are coalesced or not.
Upvotes: 2