Reputation: 35
When I call thrust::inclusive_scan
several times, why is the first time much slower than subsequent calls?
Here is the code
float ttime;
for(int i=0;i<5;i++){
cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
thrust::device_ptr<int > din(device_input);
thrust::device_ptr<int > dout(device_output);
thrust::inclusive_scan(din,din+N,dout);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&ttime,start,stop);
printf("cost %fms\n",ttime);
}
I run it on GTX1080,and result are
cost 39.180702ms
cost 0.200704ms
cost 0.201728ms
cost 0.202752ms
cost 0.197632ms
Can anybody help explain this?
Upvotes: 0
Views: 390
Reputation: 132220
Adding a few words to Talonmies' valid answer:
In this question of mine, there's some cocktail-napkin calculations of how much time the CUDA initialization takes.
I also suggest that, to separate the loading+init time for libthrust from the runtime API initialization overhead, you perform the following three phases:
Roughly, (T_1 - T_2) is the CUDA load & init time and (T_3 - T_4) is the thrust load & init time.
Upvotes: 1
Reputation: 72352
Thrust is built using the CUDA runtime API, and that API uses lazy context initialisation.
The exact initialisation sequence is not documented and there is empirical evidence that it had changed over time. However, it appears that context setup is done on an ad hoc basis.
It is likely that the slow first call is related to loading and initialisation of the module containing the thrust code within your program. You might be able to verify this by profiling your code and looking at the profile execution time versus the wall clock time for that first call.
Upvotes: 2