Device - host vs host - device copy performance in cuda

Question

i'm new to CUDA and my first task is to implement performance metrics.

i've noticed that copying data from Host to Device using thrust vectors takes less time compared to copying data from Device to Host. can anyone explain why ?

int dimension = 1000000; 
thrust::host_vector  host_Table (dimension);

tic2=get_time();
thrust::device_vector device_Table =host_Table;
toc2=get_time();

tic3=get_time();
thrust::host_vector host_TableCopiedFromDevice = device_Table;
toc3=get_time();

the difference if pretty huge between toc2-tic2 and toc3-tic3.

thanks

If_You_Say_So · Accepted Answer

First off, rather than using CPU timers, keep in mind that it is better to use Cuda Event API for timing measurements. Also you may want to consider a warmup call before the timing (see here for more info). I think @Robert Crovella has already answered your question in his comment by mentioning that the vector instantiation is likely the cause of the time difference. But just to proof it, I did a simple test where I measured device to host (D2H) and host to device (H2D) transfer times for two cases with and without a vector allocation. Consider this code which is basically equal to your code:

#include 
#include 
#include 

int main(){

  int dimension = 1000000; 

  // Some dummy vector to wake up device
  thrust::device_vector dummy_vec (dimension, 1);

  // Create a Cuda event
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  float elapsed = 0; // time in ms

  thrust::host_vector  host_Table (dimension);

  // H2D:
  cudaEventRecord(start);
  thrust::device_vector device_Table = host_Table;  
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&elapsed, start, stop);
  std::cout<<"H2D elapsed time: " << elapsed << " ms"<< std::endl;

  // D2H:        
  cudaEventRecord(start);
  thrust::host_vector host_TableCopiedFromDevice = device_Table;
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&elapsed, start, stop);
  std::cout<<"D2H elapsed time: " << elapsed << " ms"<< std::endl;
}

Running this on a Titan Black (Ubuntu, CUDA 10.1) gives the following time values:

H2D elapsed time: 1.76941 ms
D2H elapsed time: 3.80643 ms

You are right here. The D2H time is almost 2 times larger than the H2D. Now, the same code with vectors allocated prior to the transfers:

#include 
#include 
#include 

int main(){

  int dimension = 1000000; 

  // Some dummy vector to wake up device
  thrust::device_vector dummy_vec (dimension, 1);

  // Create a Cuda event
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  float elapsed = 0; // time in ms

  // initialized vectors
  thrust::host_vector  h_vec (dimension, 1);
  thrust::device_vector  d_vec (dimension);
  thrust::host_vector  h_vec_2 (dimension);

  // H2D:
  cudaEventRecord(start);
  d_vec = h_vec;
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&elapsed, start, stop);
  std::cout<<"H2D elapsed time: " << elapsed << " ms"<< std::endl;

  // D2H:
  cudaEventRecord(start);
  h_vec_2 = d_vec;
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&elapsed, start, stop);
  std::cout<<"D2H elapsed time: " << elapsed << " ms"<< std::endl;

}

which gives:

H2D elapsed time: 1.7777 ms
D2H elapsed time: 1.54707 ms

Which confirms that the H2D and D2H memory transfers are actually about the same if we exclude other factors. Another investigation that could have given you some hints was to change the dimension to a smaller/larger value and see how that changes the time difference.

Device -> host vs host -> device copy performance in cuda

Answers (1)

Related Questions

Device -&gt; host vs host -&gt; device copy performance in cuda

Answers (1)

Related Questions

Device -> host vs host -> device copy performance in cuda