KQS
KQS

Reputation: 1597

cuBLAS dsyrk slower than dgemm

I am trying to compute C = A*A' on the GPU using cuBLAS and am finding that the rank-k update cublasDsyrk is running about 5x slower than the general matrix-matrix multiplication routine cublasDgemm.

This is surprising to me; I thought syrk would be faster since it is a more specialized piece of code. Is that an unreasonable expectation? Am I doing this wrong?

Timing the code

Ultimately I'm writing CUDA code to be compiled into MEX files for MATLAB, so apologies for not providing a complete working example (there would be a lot of extraneous code for wrangling with the MATLAB objects).

I know this is probably not the best way, but I'm using clock() to time how long the code takes to run:

// Start of main function
clock_t tic = clock();
clock_t toc;

/* ---- snip ---- */

cudaDeviceSynchronize();

toc = clock();
printf("%8d (%7.3f ms) Allocated memory on GPU for output matrix\n",
        toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);

// Compute the upper triangle of C = alpha*A*A' + beta*C
stat = cublasDsyrk(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N, 
        M, N, &alpha, A, M, &beta, C, M);

toc = clock();
printf("%8d (%7.3f ms) cublasDsyrk launched\n",
        toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);

cudaDeviceSynchronize();

toc = clock();
printf("%8d (%7.3f ms) cublasDsyrk completed\n",
        toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);

/* ----- snip ----- */

Runtimes

The output, running on a [12 x 500,000] random matrix (column-major storage):

  911 (  0.911 ms) Loaded inputs, initialized cuBLAS context
 1111 (  1.111 ms) Allocated memory on GPU for output matrix
 1352 (  1.352 ms) cublasDsyrk launched
85269 ( 85.269 ms) cublasDsyrk completed
85374 ( 85.374 ms) Launched fillLowerTriangle kernel
85399 ( 85.399 ms) kernel completed
85721 ( 85.721 ms) Finished and cleaned up

After replacing the syrk call with

stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, M, M, N, 
        &alpha, A, M, A, M, &beta, C, M);

the whole thing runs way faster:

  664 (  0.664 ms) Loaded inputs, initialized cuBLAS context
  796 (  0.796 ms) Allocated memory on GPU for output matrix
  941 (  0.941 ms) cublasDgemm launched
16787 ( 16.787 ms) cublasDgemm completed
16837 ( 16.837 ms) Launched fillLowerTriangle kernel
16859 ( 16.859 ms) kernel completed
17263 ( 17.263 ms) Finished and cleaned up

I tried it with a few matrices of other sizes; interestingly it seems that the speed difference is most pronounced when the matrix has few rows. At 100 rows, gemm is only 2x faster, and at 1000 rows it's slightly slower (which is what I would have expected all along).

Other details

I'm using CUDA Toolkit 7.5 and the GPU device is an NVIDIA Grid K520 (Kepler, compute capability 3.0). I'm running on an Amazon EC2 g2.x2large instance.

Upvotes: 2

Views: 919

Answers (1)

kangshiyin
kangshiyin

Reputation: 9781

[n x 500,000] for n=12,100,1000 are all very wide matrix. In these corner cases, gemm() and syrk() may not be able to reach their peak performance, where syrk() is nearly twice faster as gemm() (as the result matrix is symentric so you can save half of the computation).

Another consideration is that CUDA gemm()/syrk() usually divides matrix into fixed size sub-matrices as the basic computing unit to achieve high performance. The size of the sub-matrix can be up to 32x64 for dgemm() as shown in the following link.

http://www.netlib.org/lapack/lawnspdf/lawn267.pdf

The performance usually drops a lot if your size (12 or 100) is neither much larger than the sub-matrix nor a multiple of it.

Upvotes: 2

Related Questions