Why cuFFT is "slow" on K40?

Question

I've compared a simple 3D cuFFT program on both a GTX 780 and a Tesla K40 in double precision mode.

On the GTX 780 I measured about 85 Gflops, while on the K40 I measured about 160 Gflops. These results baffled me: the GTX 780 ha 166 Gflops of peak theoretical performance while the K40 has 1.4 Tflops.

The fact that the effective performance of cuFFT on the K40 is so distant from the theoretical peak performance also comes from the graphs created by Nvidia at this link.

Can someone explain to me why this happens? Is there a limit for the cuFFT library? Maybe some cache motivations...

talonmies · Accepted Answer

The very short answer is that a double precision FFT on a GTX 780 is most likely arithmetic instruction throughput limited, but the same FFT operation is memory bandwidth limited on a Tesla K40.

The slightly longer answer is that a K40 has about 288 Gb/s peak memory bandwidth, which is 36 Gwords/s for an 8 byte type like an IEEE 754 float64. The arithmetic throughput of the FFT will be limited to the number of FLOP which it can execute for that memory throughput. Hitting peak double FLOP/s would require something approaching 40 double precision operations per memory transaction. Clearly an FFT isn¨t arithmetically intensive enough, and the result is a much lower peak arithmetic throughput.

On the GTX 780, which has about the same memory bandwith as the K40, but about 8 times lower peak double precision throughput, it seems that it is possible to get closer to the arithmetic peak at the available memory bandwith.

Why cuFFT is "slow" on K40?

Answers (1)

Related Questions

Why cuFFT is &quot;slow&quot; on K40?

Answers (1)

Related Questions

Why cuFFT is "slow" on K40?