Reputation: 1023
In the following code I'm using the function cublasSetMatrix for 3 random matrices of size 200x200. I measured the the time of this function in the code:
clock_t t1,t2,t3,t4;
int m =200,n = 200;
float * bold1 = new float [m*n];
float * bold2 = new float [m*n];
float * bold3 = new float [m*n];
for (int i = 0; i< m; i++)
for(int j = 0; j <n;j++)
{
bold1[i*n+j]=rand()%10;
bold2[i*n+j]=rand()%10;
bold3[i*n+j]=rand()%10;
}
float * dev_bold1, * dev_bold2,*dev_bold3;
cudaMalloc ((void**)&dev_bold1,sizeof(float)*m*n);
cudaMalloc ((void**)&dev_bold2,sizeof(float)*m*n);
cudaMalloc ((void**)&dev_bold3,sizeof(float)*m*n);
t1=clock();
cublasSetMatrix(m,n,sizeof(float),bold1,m,dev_bold1,m);
t2 = clock();
cublasSetMatrix(m,n,sizeof(float),bold2,m,dev_bold2,m);
t3 = clock();
cublasSetMatrix(m,n,sizeof(float),bold3,m,dev_bold2,m);
t4 = clock();
cout<<double(t2-t1)/CLOCKS_PER_SEC<<" - "<<double(t3-t2)/CLOCKS_PER_SEC<<" - "<<double(t4-t3)/CLOCKS_PER_SEC;
delete []bold1;
delete []bold2;
delete []bold3;
cudaFree(dev_bold1);
cudaFree(dev_bold2);
cudaFree(dev_bold3);
The output of this code is something like this:
0.121849 - 0.000131 - 0.000141
Actually, every time I run the code the time of applying cublasSetMatrix on the first matrix is more than other two matrices, although the size of all matrices are the same and they are filled with random numbers.
Can anyone please help me to find out what is the reason of this result?
Upvotes: 0
Views: 90
Reputation: 152164
Usually the first CUDA API call in any CUDA program will incur some start-up overhead - the CUDA runtime requires time to initialize everything.
Whenever CUDA libraries are used, there will be some additional one-time start up overhead associated with initialization of the library. This overhead will often be observed to impact the timing of the first library call.
That seems to be what is happening here. By placing another cuBLAS API call before the first one you are measuring, you have moved the start-up overhead cost to a previous call, and so you don't measure it on the cublasSetMatrix()
call anymore.
Upvotes: 2