redips
redips

Reputation: 35

Why the call to thrust::inclusive_scan is much slower than subsequent calls?

When I call thrust::inclusive_scan several times, why is the first time much slower than subsequent calls?

Here is the code

float ttime;
for(int i=0;i<5;i++){
     cudaEvent_t start,stop; 
     cudaEventCreate(&start); 
     cudaEventCreate(&stop);
     cudaEventRecord(start,0);

     thrust::device_ptr<int > din(device_input);
     thrust::device_ptr<int > dout(device_output);
     thrust::inclusive_scan(din,din+N,dout);

     cudaEventRecord(stop,0); 
     cudaEventSynchronize(stop); 
     cudaEventElapsedTime(&ttime,start,stop);
     printf("cost %fms\n",ttime);
}

I run it on GTX1080,and result are

cost 39.180702ms
cost 0.200704ms
cost 0.201728ms
cost 0.202752ms
cost 0.197632ms

Can anybody help explain this?

Upvotes: 0

Views: 390

Answers (2)

einpoklum
einpoklum

Reputation: 132220

Adding a few words to Talonmies' valid answer:

  • In this question of mine, there's some cocktail-napkin calculations of how much time the CUDA initialization takes.

  • I also suggest that, to separate the loading+init time for libthrust from the runtime API initialization overhead, you perform the following three phases:

    1. Execute some dummy kernel which doesn't write any output - twice (T_1, T2_)
    2. Make a libthrust call with (almost) no data (but which does launch a kernel) - twice (T_3, T_4)
    3. Now time your real calls

Roughly, (T_1 - T_2) is the CUDA load & init time and (T_3 - T_4) is the thrust load & init time.

  • Looking at the profiling timeline is helpful; CUDA will "shove" most of its initialization into one of your API calls - but not necessarily the first one.

Upvotes: 1

talonmies
talonmies

Reputation: 72352

Thrust is built using the CUDA runtime API, and that API uses lazy context initialisation.

The exact initialisation sequence is not documented and there is empirical evidence that it had changed over time. However, it appears that context setup is done on an ad hoc basis.

It is likely that the slow first call is related to loading and initialisation of the module containing the thrust code within your program. You might be able to verify this by profiling your code and looking at the profile execution time versus the wall clock time for that first call.

Upvotes: 2

Related Questions