Reputation: 4718
I am comparing blas
with cublas
and I'm getting some mind blowing results.
The cpu I am using is a Intel(R) Xeon(R) CPU E5-2680 v2
at 2.8 GHz
and I am running my matrix multiplications with cblas_dgemm
on increasingly larger sizes of matrices.
The gpu I am using is a Nvidia K40
with 15 multiprocessors, warp size of 32, and 480 CUDA
cores (advertised as 2880 CUDA
cores here). The clock speed is 0.71 GHz
and I am using cublasDgemm
for matrix multiplications.
I have done a runtime analysis and shown that the K40
is ~12.48%
faster than the K80
for large matrix operations which is about what I expected. I am showing that the K40
is about 8000%
faster than a single threaded CPU matrix dot product and this is a whole lot faster than I expected, so I suspect something is amiss.
NOTE: I am testing with 100
iterations and averaging the runs, but I am counting only calls to the respective *gemm
functions. I am intentionally leaving out memory allocation time on the cpu
and gpu
since I want to test how fast things can go after the cpu
to gpu
data transfer has completed. Given this information, is 80x
speedup plausible?
Upvotes: 1
Views: 2820
Reputation: 1125
I agree that 80x speedup is plausable if you are comparing DGEMM on a single core of the CPU. I have done a similar benchmark on a E5-2670 v1 @ 2.6GHz
and I got the following results for double precision DGEMM with Intel MKL (which should give a good upper bound on performance)
the source code for my benchmark is on github
The Ivy Bridge processor that you are testing with does not have FMA, like the Sandy Bridge I use, and has nearly the same turbo frequency, so I expect single core performance to be similar.
I haven't benchmarked DGEMM on a K40, but in my experience you get close to the peak performance for DGEMM on NVIDIA Keplar GPUs. Peak on a K40 is 1660 GFlops, which is 66x times faster than my single core results... the same ballpark as your observed 80x.
The larger speedup you see might be because you are using single core performance of a DGEMM that is slower than a very highly tuned implementation like MKL. To get a more representative benchmark you will have to
Upvotes: 0
Reputation: 151799
80x speedup is plausible. I think you could witness something like that in any of the following cases:
In each case, the comparison is between unoptimized code and optimized code.
In the case of an Intel CPU, two key factors to get high performance are to use multiple threads (to engage most or all of the cores) and to use AVX (to engage the vector processing unit(s)). It's possible that your cblas dgemm isn't doing this, and so will run quite slowly. cublas dgemm will use the GPU efficiently, and in the case of an intel CPU, MKL dgemm will use the CPU efficiently.
Whenever possible, whether programming on the GPU or the CPU, you should use libraries and especially for operations like matrix multiply or FFT, where the underlying efficient realization is difficult to achieve. Intel MKL, or perhaps OpenBLAS, might be good choices for Intel CPU BLAS implementations.
Upvotes: 1