Is this kind of speedup with CUDA to be expected?

Question

I am comparing blas with cublas and I'm getting some mind blowing results.

The cpu I am using is a Intel(R) Xeon(R) CPU E5-2680 v2 at 2.8 GHz and I am running my matrix multiplications with cblas_dgemm on increasingly larger sizes of matrices.

The gpu I am using is a Nvidia K40 with 15 multiprocessors, warp size of 32, and 480 CUDA cores (advertised as 2880 CUDA cores here). The clock speed is 0.71 GHz and I am using cublasDgemm for matrix multiplications.

I have done a runtime analysis and shown that the K40 is ~12.48% faster than the K80 for large matrix operations which is about what I expected. I am showing that the K40 is about 8000% faster than a single threaded CPU matrix dot product and this is a whole lot faster than I expected, so I suspect something is amiss.

NOTE: I am testing with 100 iterations and averaging the runs, but I am counting only calls to the respective *gemm functions. I am intentionally leaving out memory allocation time on the cpu and gpu since I want to test how fast things can go after the cpu to gpu data transfer has completed. Given this information, is 80x speedup plausible?

Robert Crovella · Accepted Answer

80x speedup is plausible. I think you could witness something like that in any of the following cases:

dgemm on CPU using cblas in a single thread, and comparing to intel MKL dgemm
dgemm on CPU using cblas in a single thread, and comparing to cublas dgemm
naive GPU matrix multiply, and comparing to cublas dgemm

In each case, the comparison is between unoptimized code and optimized code.

In the case of an Intel CPU, two key factors to get high performance are to use multiple threads (to engage most or all of the cores) and to use AVX (to engage the vector processing unit(s)). It's possible that your cblas dgemm isn't doing this, and so will run quite slowly. cublas dgemm will use the GPU efficiently, and in the case of an intel CPU, MKL dgemm will use the CPU efficiently.

Whenever possible, whether programming on the GPU or the CPU, you should use libraries and especially for operations like matrix multiply or FFT, where the underlying efficient realization is difficult to achieve. Intel MKL, or perhaps OpenBLAS, might be good choices for Intel CPU BLAS implementations.

Is this kind of speedup with CUDA to be expected?

Answers (2)

Related Questions