Cuda global memory load and store

Question

So I am trying to hide global memory latency. Take the following code:

for(int i = 0; i < N; i++){
     x = global_memory[i];

     ... do some computation on x ...

     global_memory[i] = x;
}

I wanted to know whether load and store from global memory is blocking, i.e, it doesn't run next line until load or store is finished. For example take the following code:

x_next = global_memory[0];
for(int i = 0; i < N; i++){
     x = x_next;
     x_next = global_memory[i+1];

     ... do some computation on x ...

     global_memory[i] = x;
}

In this code, x_next is not used until next iteration, so does loading x_next overlap with the computation? In other words, which of the following figures will happen?

Cuda global memory load and store

Answers (1)

Related Questions