Reputation: 1023
I'm working on a code in which I have to perform a vector-matrix multiplication on a chunk of data, copying the results back to CPU and then start multiplying another chunk. I perform the vector to matrix multiplication using cublas library (following code).
clock_t a,b;
a = clock();
for(int i=0;i<n;i++)
{
cublasSgemv(handle,CUBLAS_OP_T,m,k,&alpha, dev_b1+((i+1)*m), m, dev_b1+(i*m),1, &beta,out,1);
out+=(n-(i+1));
cudaMemcpy(b3,dev_b3, sizeof(float)*(cor_size), cudaMemcpyDeviceToHost);
}
b = clock();
cout<<"Running time is: "<<(double)(b-a)/clocks_per_sec;
I have to measure the running time of this for loop. I read something about CudaEvent but in my case, I want to measure the time of total loop not a kernel so I used clock function. I am wondering is this a correct way to measure the time for this chunk of code or there are more accurate ways to do that? I know that for measuring elapsed time we have to repeat running the code multiple times and take the average of elapsed times of all runs, so another question is that is there any trade-off for the number of times that running code should be repeated?
Thanks
Upvotes: 0
Views: 405
Reputation: 506
cudaMemcpy synchronizes host and device, so a CPU timer such as clock_t should give results that are identical with those produced by a CUDA timer, making the necessary allowances for the granularity/resolution of clock_t.
As regards the accuracy of the measurements is concerned, from what I have seen, the first iteration timings could be disregarded in the calculations. Subsequent timing measurements should yield numbers depending on factors such as load imbalance in the algorithm being run, which might decide on whether we get the same numbers at every iteration. I would reckon that that would not be an issue here, with Sgemm.
Upvotes: 1
Reputation: 769
You can still use CUDA events to measure the entire loop runtime, by recording two events (one before starting the loop, one after the end, i.e. in the positions where you are currently using clock()
), synchronizing on the second event and then getting the elapsed time using cudaEventElapsedTime()
. This should have the advantage of being more accurate than clock()
.
Upvotes: 1