drjrm3
drjrm3

Reputation: 4718

Is this kind of speedup with CUDA to be expected?

I am comparing blas with cublas and I'm getting some mind blowing results.

The cpu I am using is a Intel(R) Xeon(R) CPU E5-2680 v2 at 2.8 GHz and I am running my matrix multiplications with cblas_dgemm on increasingly larger sizes of matrices.

The gpu I am using is a Nvidia K40 with 15 multiprocessors, warp size of 32, and 480 CUDA cores (advertised as 2880 CUDA cores here). The clock speed is 0.71 GHz and I am using cublasDgemm for matrix multiplications.

I have done a runtime analysis and shown that the K40 is ~12.48% faster than the K80 for large matrix operations which is about what I expected. I am showing that the K40 is about 8000% faster than a single threaded CPU matrix dot product and this is a whole lot faster than I expected, so I suspect something is amiss.

NOTE: I am testing with 100 iterations and averaging the runs, but I am counting only calls to the respective *gemm functions. I am intentionally leaving out memory allocation time on the cpu and gpu since I want to test how fast things can go after the cpu to gpu data transfer has completed. Given this information, is 80x speedup plausible?

Upvotes: 1

Views: 2820

Answers (2)

bcumming
bcumming

Reputation: 1125

I agree that 80x speedup is plausable if you are comparing DGEMM on a single core of the CPU. I have done a similar benchmark on a E5-2670 v1 @ 2.6GHz and I got the following results for double precision DGEMM with Intel MKL (which should give a good upper bound on performance)

  • single core : 25 GFlops
  • eight cores : 169 GFlops

the source code for my benchmark is on github

The Ivy Bridge processor that you are testing with does not have FMA, like the Sandy Bridge I use, and has nearly the same turbo frequency, so I expect single core performance to be similar.

I haven't benchmarked DGEMM on a K40, but in my experience you get close to the peak performance for DGEMM on NVIDIA Keplar GPUs. Peak on a K40 is 1660 GFlops, which is 66x times faster than my single core results... the same ballpark as your observed 80x.

The larger speedup you see might be because you are using single core performance of a DGEMM that is slower than a very highly tuned implementation like MKL. To get a more representative benchmark you will have to

  • use a more optimized DGEMM implementation for your host benchmark (try MKL if it is available)
  • use all 10 cores of the CPU

Upvotes: 0

Robert Crovella
Robert Crovella

Reputation: 151799

80x speedup is plausible. I think you could witness something like that in any of the following cases:

  1. dgemm on CPU using cblas in a single thread, and comparing to intel MKL dgemm
  2. dgemm on CPU using cblas in a single thread, and comparing to cublas dgemm
  3. naive GPU matrix multiply, and comparing to cublas dgemm

In each case, the comparison is between unoptimized code and optimized code.

In the case of an Intel CPU, two key factors to get high performance are to use multiple threads (to engage most or all of the cores) and to use AVX (to engage the vector processing unit(s)). It's possible that your cblas dgemm isn't doing this, and so will run quite slowly. cublas dgemm will use the GPU efficiently, and in the case of an intel CPU, MKL dgemm will use the CPU efficiently.

Whenever possible, whether programming on the GPU or the CPU, you should use libraries and especially for operations like matrix multiply or FFT, where the underlying efficient realization is difficult to achieve. Intel MKL, or perhaps OpenBLAS, might be good choices for Intel CPU BLAS implementations.

Upvotes: 1

Related Questions