Reputation: 311

CUDA host to device transfer faster than device to host transfer

I was working on a simple cuda program in which I figured out that 90% of the time was coming from a single statement which was a cudamemcpy from device to host. The program was transfering some 2MB was data from host to device in 600-700microseconds and was copying back 4MB of data from device to host in 10ms. The total time taken by my program was 13ms. My question is that why there is an asymmetry in the two copying from device to host and host to device. Is it because cuda devlopers thought that copying back would be usually smaller in bytes. My second question is that is there any way to circumvent it.

I am using a Fermi GTX560 graphics card with 343 cores and 1GB memory.

Upvotes: 5

Answers (1)

phoad

Reputation: 1861

Timing of CUDA functions is a bit different than CPU. First of all be sure that you do not take the initialization cost of CUDA into account by calling a CUDA function at the start of your application, otherwise it might be initialized while you started your timing.

int main (int argc, char **argv) {
   cudaFree(0);
   ....//cuda is initialized..

}

Use a Cutil timer like this

unsigned int timer;
cutCreateTimer(&timer);
cutStartTimer(timer);

//your code, to assess elapsed time..

cutStopTimer(timer);
printf("Elapsed: %.3f\n", cutGetTimerValue(timer));
cutDeleteTimer(timer);

Now, after these preliminary steps lets look at the problem. When a kernel is called, the CPU part will be stalled only till the call is delivered to GPU. The GPU will continue execution while the CPU continues too. If you call cudaThreadSynchronize(..), then the CPU will stall till the GPU finishes current call. cudaMemCopy operation also requires GPU to finish its execution, because the values that should be filled by the kernel is requested.

kernel<<<numBlocks, threadPerBlock>>>(...);

cudaError_t err = cudaThreadSynchronize();
if (cudaSuccess != err) {
  fprintf(stderr, "cudaCheckError() failed at %s:%i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err ) );
  exit(1);
}

//now the kernel is complete..
cutStopTimer(timer);

So place a synchronization before calling the stop timer function. If you place a memory copy after the kernel call, then the elapsed time of memory copy will include some part of the kernel execution. So memCopy operation may be placed after the timing operations.

There are also some profiler counters that may be used to assess some sections of the kernels.

How to profile the number of global memory transactions for cuda kernels?

How Do You Profile & Optimize CUDA Kernels?

Upvotes: 3

CUDA host to device transfer faster than device to host transfer

Answers (1)

Related Questions