Reputation: 163
I want to count the time of kernel which should be runned for more than 1 times,the data to be processed is different for each kernel being executed.My code is below, for the time of cudaMemcpy should not be counted.
1 cudaEvent_t start;
2 error = cudaEventCreate(&start);
3 cudaEvent_t stop;
4 error = cudaEventCreate(&stop);
6 float msecTotal = 0.0f;
7 int nIter = 300;
8 for (int j = 0; j < nIter; j++)
9 {
10 cudaMemcpy(...);
// Record the start event
11 error = cudaEventRecord(start, NULL);
12 matrixMulCUDA1<<< grid, threads >>>(...);
// Record the stop event
13 error = cudaEventRecord(stop, NULL);
14 error = cudaEventSynchronize(stop);
15 float msec = 0.0f;
16 error = cudaEventElapsedTime(&msec, start, stop);
17 msecTotal+=msec;
18 }
19 cout<<"Total time = "<<msecTotal<<endl;
To be fair,the contrast algorithm should be below:
1 cudaEvent_t start;
2 error = cudaEventCreate(&start);
3 cudaEvent_t stop;
4 error = cudaEventCreate(&stop);
6 float msecTotal = 0.0f;
7 int nIter = 300;
8 for (int j = 0; j < nIter; j++)
9 {
// Record the start event
11 error = cudaEventRecord(start, NULL);
12 matrixMulCUDA2<<< grid, threads >>>(...);
// Record the stop event
13 error = cudaEventRecord(stop, NULL);
14 error = cudaEventSynchronize(stop);
15 float msec = 0.0f;
16 error = cudaEventElapsedTime(&msec, start, stop);
17 msecTotal+=msec;
18 }
19 cout<<"Total time = "<<msecTotal<<endl;
My question is that the method is right? for I am not sure. Obviously,the time should be more longer than normal.
Upvotes: 0
Views: 1883
Reputation: 4422
You should get similar results either way. By recording the events around the kernel launch, you are definitely measuring only the time spent in the kernel and not any time spent on the memcpy.
My only nit is that by calling cudaEventSynchronize() on every iteration of the loop, you are breaking CPU/GPU concurrency that is actually quite important to getting good performance. If you must time each kernel invocation separately (as opposed to putting the for loop of nIter iterations around the kernel invocation as opposed to the whole operation), you might want to allocate more CUDA events. If you go that route, you do not need 2 events per loop iteration - you need to bracket the operation with two, and record only need one CUDA event per loop iteration. Then the time for any given kernel invocation can be computed by calling cudaEventElapsedTime() on adjacent recorded events.
To record the GPU times between N events:
cudaEvent_t events[N+2];
cudaEventRecord( events[0], NULL ); // record first event
for (j = 0; j < nIter; j++ ) {
// invoke kernel, or do something else you want to time
// cudaEventRecord( events[j+1], NULL );
}
cudaEventRecord( events[j], NULL );
// to compute the time taken for operation i, call:
float ms;
cudaEventElapsedTime( &ms, events[i+1], events[i] );
Upvotes: 1