Which is faster in CUDA, global memory or host memory?

Question

I read from CUDA by Example, chapter 9.4, that when using atomic operations on GPU global memory improperly, performance of the program may be worse than that when executed purely on CPU, because of the memory access contention.

In the worse case, the program executed on GPU is highly serialized and no threads execute in parallel, which is just the way a single-threaded program run on the CPU. So the key problem is how fast the program accesses the memory.

Considering the example in the book I mentioned, it seems that CPU accesses host memory faster than GPU accesses global memory on device.

Is that so? Or are there any other factors that influence the performance of the program under the circumstance I just described?

andrew cooke · Accepted Answer

i think you're misreading things slightly. yes, it's saying that single-threaded code on the GPU is typically slower than on the CPU. but that's not because of raw memory bandwidth - it's because a CPU is much more powerful than a GPU when running a single thread. for example, a CPU has pipelining and sophisticated branch prediction to pre-load data from memory, while a GPU is designed to switch contexts to another thread when waiting for data. the CPU is tuned for the single threaded case while the GPU is tuned for many threads.

if you want to know which memory is fastest, look at the technical specs for your card and mobo, but that's not really what the book is talking about.

Which is faster in CUDA, global memory or host memory?

Answers (1)

Related Questions