Reputation: 1597
I am trying to compute C = A*A' on the GPU using cuBLAS and am finding that the rank-k update cublasDsyrk
is running about 5x slower than the general matrix-matrix multiplication routine cublasDgemm
.
This is surprising to me; I thought syrk
would be faster since it is a more specialized piece of code. Is that an unreasonable expectation? Am I doing this wrong?
Ultimately I'm writing CUDA code to be compiled into MEX files for MATLAB, so apologies for not providing a complete working example (there would be a lot of extraneous code for wrangling with the MATLAB objects).
I know this is probably not the best way, but I'm using clock()
to time how long the code takes to run:
// Start of main function
clock_t tic = clock();
clock_t toc;
/* ---- snip ---- */
cudaDeviceSynchronize();
toc = clock();
printf("%8d (%7.3f ms) Allocated memory on GPU for output matrix\n",
toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);
// Compute the upper triangle of C = alpha*A*A' + beta*C
stat = cublasDsyrk(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N,
M, N, &alpha, A, M, &beta, C, M);
toc = clock();
printf("%8d (%7.3f ms) cublasDsyrk launched\n",
toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);
cudaDeviceSynchronize();
toc = clock();
printf("%8d (%7.3f ms) cublasDsyrk completed\n",
toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);
/* ----- snip ----- */
The output, running on a [12 x 500,000] random matrix (column-major storage):
911 ( 0.911 ms) Loaded inputs, initialized cuBLAS context
1111 ( 1.111 ms) Allocated memory on GPU for output matrix
1352 ( 1.352 ms) cublasDsyrk launched
85269 ( 85.269 ms) cublasDsyrk completed
85374 ( 85.374 ms) Launched fillLowerTriangle kernel
85399 ( 85.399 ms) kernel completed
85721 ( 85.721 ms) Finished and cleaned up
After replacing the syrk
call with
stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, M, M, N,
&alpha, A, M, A, M, &beta, C, M);
the whole thing runs way faster:
664 ( 0.664 ms) Loaded inputs, initialized cuBLAS context
796 ( 0.796 ms) Allocated memory on GPU for output matrix
941 ( 0.941 ms) cublasDgemm launched
16787 ( 16.787 ms) cublasDgemm completed
16837 ( 16.837 ms) Launched fillLowerTriangle kernel
16859 ( 16.859 ms) kernel completed
17263 ( 17.263 ms) Finished and cleaned up
I tried it with a few matrices of other sizes; interestingly it seems that the speed difference is most pronounced when the matrix has few rows. At 100 rows, gemm
is only 2x faster, and at 1000 rows it's slightly slower (which is what I would have expected all along).
I'm using CUDA Toolkit 7.5 and the GPU device is an NVIDIA Grid K520 (Kepler, compute capability 3.0). I'm running on an Amazon EC2 g2.x2large instance.
Upvotes: 2
Views: 919
Reputation: 9781
[n x 500,000] for n=12,100,1000 are all very wide matrix. In these corner cases, gemm()
and syrk()
may not be able to reach their peak performance, where syrk()
is nearly twice faster as gemm()
(as the result matrix is symentric so you can save half of the computation).
Another consideration is that CUDA gemm()
/syrk()
usually divides matrix into fixed size sub-matrices as the basic computing unit to achieve high performance. The size of the sub-matrix can be up to 32x64 for dgemm()
as shown in the following link.
http://www.netlib.org/lapack/lawnspdf/lawn267.pdf
The performance usually drops a lot if your size (12 or 100) is neither much larger than the sub-matrix nor a multiple of it.
Upvotes: 2