FFTW3 - Parallelising 1D in place complex fft is slow

Question

So I am working on parallelising 1D FFT. As a first task, I performed the benchmarking of the FFTW3 library on Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz which has 16 cores. I just did a basic 1D Complex FFT, with OpenMP as my threading library. I compiled on ICC using the following command:

icc -Wall -Werror
-I/.../mkl/include -I/apps/intel/linux/mkl/include/fftw  
fftw3_dft.c   
-L/.../intel/linux/mkl/.../intel64 -lmkl_rt 
-L/.../intel/.../linux/mkl/../compiler/lib/intel64 
-L/apps/intel/.../clinux/mkl/../tbb/lib/intel64/gcc4.4 
-liomp5 -lm -lpthread -ldl 
 -o fftw3_dft.out

I calculated the speedup metric for different problem sizes. I am not able to explain this plot

For problem sizes between 2^21 and 2^24, why is there no speedup on using 2 processors? (Even though there is some speedup for 4,8 and 16 threads)
Why is there a sudden increase in speedup when problem size becomes greater than 2^27?

Code

#include 
#include 
#include 
#include 

#include "fftw3.h"
#include "mkl.h"

/* Compute (K*L)%M accurately */
static double moda(int K, int L, int M)
{
    return (double)(((long long)K * L) % M);
}

/* Initialize array x[N] with harmonic H */
static void init(fftw_complex *x, int N, int H)
{
    double TWOPI = 6.2831853071795864769, phase;
    int n;

    for (n = 0; n < N; n++)
    {
         phase  = moda(n,H,N) / N;
         x[n][0] = cos( TWOPI * phase ) / N;
        x[n][1] = sin( TWOPI * phase ) / N;
    }
}

int main(int argc, char *argv[]) {

    if(argc < 3) {
         printf("Error : give args
");
        return 0;
     }


int N = atoi(argv[1]);
int p = atoi(argv[2]);

int H =  -N/2;
fftw_plan forward_plan = 0, backward_plan = 0;
fftw_complex *x = 0;
int status = 0;

fftw_init_threads();
fftw_plan_with_nthreads(p);

x  = fftw_malloc(sizeof(fftw_complex)*N);

forward_plan = fftw_plan_dft(1, &N, x, x, FFTW_FORWARD, FFTW_ESTIMATE);

init(x, N, H);

double start_time = dsecnd();

/*--------------ALG STARTS HERE --------------------------*/
fftw_execute(forward_plan);
/*--------------ALG ENDS HERE --------------------------*/

double end_time = dsecnd();

printf(LI", %d, %lf
", N, p, end_time - start_time);
fftw_cleanup_threads()
fftw_destroy_plan(forward_plan);
fftw_free(x);

}

FFTW3 - Parallelising 1D in place complex fft is slow

Answers (1)

Related Questions