Hmdee
Hmdee

Reputation: 11

Device -> host vs host -> device copy performance in cuda

i'm new to CUDA and my first task is to implement performance metrics.

i've noticed that copying data from Host to Device using thrust vectors takes less time compared to copying data from Device to Host. can anyone explain why ?

int dimension = 1000000; 
thrust::host_vector <int> host_Table (dimension);

tic2=get_time();
thrust::device_vector<int> device_Table =host_Table;
toc2=get_time();

tic3=get_time();
thrust::host_vector<int> host_TableCopiedFromDevice = device_Table;
toc3=get_time();

the difference if pretty huge between toc2-tic2 and toc3-tic3.

thanks

Upvotes: 1

Views: 775

Answers (1)

If_You_Say_So
If_You_Say_So

Reputation: 1283

First off, rather than using CPU timers, keep in mind that it is better to use Cuda Event API for timing measurements. Also you may want to consider a warmup call before the timing (see here for more info). I think @Robert Crovella has already answered your question in his comment by mentioning that the vector instantiation is likely the cause of the time difference. But just to proof it, I did a simple test where I measured device to host (D2H) and host to device (H2D) transfer times for two cases with and without a vector allocation. Consider this code which is basically equal to your code:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <iostream>

int main(){

  int dimension = 1000000; 

  // Some dummy vector to wake up device
  thrust::device_vector<int> dummy_vec (dimension, 1);

  // Create a Cuda event
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  float elapsed = 0; // time in ms

  thrust::host_vector <int> host_Table (dimension);

  // H2D:
  cudaEventRecord(start);
  thrust::device_vector<int> device_Table = host_Table;  
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&elapsed, start, stop);
  std::cout<<"H2D elapsed time: " << elapsed << " ms"<< std::endl;

  // D2H:        
  cudaEventRecord(start);
  thrust::host_vector<int> host_TableCopiedFromDevice = device_Table;
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&elapsed, start, stop);
  std::cout<<"D2H elapsed time: " << elapsed << " ms"<< std::endl;
}

Running this on a Titan Black (Ubuntu, CUDA 10.1) gives the following time values:

H2D elapsed time: 1.76941 ms
D2H elapsed time: 3.80643 ms

You are right here. The D2H time is almost 2 times larger than the H2D. Now, the same code with vectors allocated prior to the transfers:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <iostream>

int main(){

  int dimension = 1000000; 

  // Some dummy vector to wake up device
  thrust::device_vector<int> dummy_vec (dimension, 1);

  // Create a Cuda event
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  float elapsed = 0; // time in ms

  // initialized vectors
  thrust::host_vector <int> h_vec (dimension, 1);
  thrust::device_vector <int> d_vec (dimension);
  thrust::host_vector <int> h_vec_2 (dimension);

  // H2D:
  cudaEventRecord(start);
  d_vec = h_vec;
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&elapsed, start, stop);
  std::cout<<"H2D elapsed time: " << elapsed << " ms"<< std::endl;

  // D2H:
  cudaEventRecord(start);
  h_vec_2 = d_vec;
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&elapsed, start, stop);
  std::cout<<"D2H elapsed time: " << elapsed << " ms"<< std::endl;

}

which gives:

H2D elapsed time: 1.7777 ms
D2H elapsed time: 1.54707 ms

Which confirms that the H2D and D2H memory transfers are actually about the same if we exclude other factors. Another investigation that could have given you some hints was to change the dimension to a smaller/larger value and see how that changes the time difference.

Upvotes: 1

Related Questions