Why is cuFFT so slow?

Question

I'm hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. However, for a variety of FFT problem sizes, I've found that cuFFT is slower than FFTW with OpenMP.

In the experiments and discussion below, I find that cuFFT is slower than FFTW for batched 2D FFTs. Why is cuFFT so slow, and is there anything I can do to make cuFFT run faster?

Experiments (code download)

Our computer vision application requires a forward FFT on a bunch of small planes of size 256x256. I'm running the FFTs on on HOG features with a depth of 32, so I use the batch mode to do 32 FFTs per function call. Typically, I do about 8 FFT function calls of size 256x256 with a batch size of 32.

FFTW + OpenMP
The following code executes in 16.0ms on an Intel i7-2600 8-core CPU.

int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};

//if nCols is even, cols_padded = (nCols+2). if nCols is odd, cols_padded = (nCols+1)
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL

float* h_in = (float*)malloc(sizeof(float)*nRows*cols_padded*depth);
memset(h_in, 0, sizeof(float)*nRows*cols_padded*depth);
fftwf_complex* h_freq = reinterpret_cast(h_in); //in-place version

fftwf_plan forwardPlan = fftwf_plan_many_dft_r2c(2, //rank
                                                 n, //dims -- this doesn't include zero-padding
                                                 depth, //howmany
                                                 h_in, //in
                                                 inembed, //inembed
                                                 depth, //istride
                                                 1, //idist
                                                 h_freq, //out
                                                 onembed, //onembed
                                                 depth, //ostride
                                                 1, //odist
                                                 FFTW_PATIENT /*flags*/);
double start = read_timer();
#pragma omp parallel for
for(int i=0; i





cuFFT 

The following code executes in 21.7ms on a top-of-the-line NVIDIA K20 GPU. Note that, even if I use streams, cuFFT does not run multiple FFTs concurrently.

int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};

int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL in FFTW
cufftHandle forwardPlan;
float* d_in; cufftComplex* d_freq;
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
              2, //rank
              n, //dimensions = {nRows, nCols}
              inembed, //inembed
              depth, //istride
              1, //idist
              onembed, //onembed
              depth, //ostride
              1, //odist
              CUFFT_R2C, //cufftType
              depth /*batch*/));

CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
d_freq = reinterpret_cast(d_in);

double start = read_timer();
for(int i=0; i




Other notes


In the GPU version, cudaMemcpys between the CPU and GPU are not included in my computation time.
The performance numbers presented here are averages of several experiments, where each experiment has 8 FFT function calls (total of 10 experiments, so 80 FFT function calls).
I've tried many problem sizes (e.g. 128x128, 256x256, 512x512, 1024x1024), all with depth=32. Based on the nvvp profiler, some sizes like 1024x1024 are able to fully saturate the GPU. But, for all of these sizes, the CPU FFTW+OpenMP is faster than cuFFT.

Why is cuFFT so slow?

Experiments (code download)

Answers (1)

Related Questions