Reputation: 3460
In my toy example I first multiply matrices of size 32x32
, 100 000
times, and after that I calculate scalar products of two vectors of size 1024
, 100 000
times again. For the first I used cublasSgemm
, for the second - cublasSdot
.
As a result, time for first calculation is 530 msec
, for the second - 10 000 msec
. However, in order to multiply matrices we need to perform 32^3
operations (multiply-add), and for scalar product just 1024=32^2
operations.
So why am I getting such result? Here is the code:
__device__ float res;
void randomInit(float *data, int size)
{
for (int i = 0; i < size; ++i)
data[i] = rand() / (float)RAND_MAX;
}
int main(){
cublasHandle_t handle;
float out;
cudaError_t cudaerr;
cudaEvent_t start1, stop1,start2,stop2;
cublasStatus_t stat;
int size = 32;
int num = 100000;
float *h_A = new float[size*size];
float *h_B = new float[size*size];
float *h_C = new float[size*size];
float *d_A, *d_B, *d_C;
const float alpha = 1.0f;
const float beta = 0.0f;
randomInit(h_A, size*size);
randomInit(h_B, size*size);
cudaMalloc((void **)&d_A, size *size *sizeof(float));
cudaMalloc((void **)&d_B, size *size * sizeof(float));
cudaMalloc((void **)&d_C, size *size * sizeof(float));
stat = cublasCreate(&handle);
cudaEventCreate(&start1);
cudaEventCreate(&stop1);
cudaEventCreate(&start2);
cudaEventCreate(&stop2);
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, size, size, size, &alpha, d_A, size,
d_B, size, &beta, d_C, size);
cudaEventRecord(start1, NULL);
cudaMemcpy(d_A, h_A, size *size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size *size * sizeof(float), cudaMemcpyHostToDevice);
for (int i = 0; i < num; i++){
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, size, size, size, &alpha, d_A,
size, d_B, size, &beta, d_C, size);
}
cudaMemcpy(h_C, d_C, size*size*sizeof(float), cudaMemcpyDeviceToHost);
cudaEventRecord(stop1, NULL);
cudaEventSynchronize(stop1);
float msecTotal1 = 0.0f;
cudaEventElapsedTime(&msecTotal1, start1, stop1);
std::cout <<"total time for MAtMul:" << msecTotal1 << "\n";
cudaEventRecord(start2, NULL);
cudaMemcpy(d_A, h_A, size *size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size *size * sizeof(float), cudaMemcpyHostToDevice);
for (int i = 0; i < num; i++){
cublasSdot(handle, 1024, d_A , 1, d_B , 1, &res);
}
cudaEventRecord(stop2, NULL);
cudaEventSynchronize(stop2);
float msecTotal2 = 0.0f;
cudaEventElapsedTime(&msecTotal2, start2, stop2);
std::cout << "total time for dotVec:" << msecTotal2 << "\n";
cublasDestroy(handle);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
delete[] h_A;
delete[] h_B;
delete[] h_C;
return 1;
}
Update: I tried also to perform dot product with cublasSgemm
by treating vector as 1 by 1024
matrix. The result is 3550 msec
, which is better, but still 7 times more then in the first calculation.
Upvotes: 0
Views: 1096
Reputation: 152164
One problem is that you're not handling the pointer mode correctly for the call to cublasSdot
.
You'll want to read this section of the manual.
Furthermore this:
cublasSdot(handle, 1024, d_A , 1, d_B , 1, &res);
^^^^
is illegal under any circumstances. It is not legal in CUDA to take the address of a device variable in host code. You can certainly do it, but the results are garbage.
When I modify your code as follows:
cublasSetPointerMode(handle, CUBLAS_POINTER_MODE_DEVICE);
float *dres;
cudaMalloc(&dres, sizeof(float));
cudaEventRecord(start2, NULL);
cudaMemcpy(d_A, h_A, size *size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size *size * sizeof(float), cudaMemcpyHostToDevice);
for (int i = 0; i < num; i++){
if(cublasSdot(handle, 1024, d_A , 1, d_B , 1, dres) != CUBLAS_STATUS_SUCCESS) {std::cout << ".";}
}
I get about a 2:1 ratio of execution time for cublasSdot
to cublasSgemm
which may be plausible, particularly for these sizes. Under the hood, the dot operation implies a parallel reduction. 1024 threads can compute the partial results, but then a 1024-thread-wide parallel reduction is required. The gemm does not need a parallel reduction, and so may be quicker. 1024 threads can be assigned to produce the 1024 results each in a single thread. For a memory-bound algorithm, the difference between 32^2 and 32^3 operations may not be that significant, but the parallel reduction implies significant additional operations. When I then change size
in your program from 32 to 128, I see the ratio reverse, and the matrix multiply does indeed become 3x longer than the dot product.
Upvotes: 3