Tips on improving FFTW performance for my Fortran solver?

Question

I use a pseudospectral DNS code for fluid simulations (a code I inherited) and I'm trying to boost performance by replacing the old FFT routines with equivalent FFTW routines. I have done this successfully in that I am getting the correct answers in my test cases, but I feel like I'm doing some things inefficiently, and I would appreciate tips on specific things that might improve my code. I will show a sample code snippet of the FFT routines and then ask specific questions based on what I've observed so far.

Sample Code:

! Complex array u of size (nyp,nz,nx/2) is stored in us and normalized before transform
! mx = (3/2)nx, mz = (3/2)nz for de-aliasing

...

complex(C_DOUBLE_COMPLEX),dimension(nyp,mz,mx) :: us,aspec
real(C_DOUBLE),dimension(nyp,mz,mx) :: up,aphys

...

! Plan FFTW transforms with dummy variables
planZb = fftw_plan_dft_1d(mz,aspec,aspec,FFTW_BACKWARD,FFTW_PATIENT)
planXb = fftw_plan_dft_c2r_1d(mx,aspec,aphys,FFTW_PATIENT)
planY  = fftw_plan_r2r_1d(nyp,aphys,aphys,FFTW_REDFT00,FFTW_PATIENT)

...

! Complex --> Complex z-transform
do k = 1,nxh
    do i = 1,nyp
        call fftw_execute_dft(planZb,us(i,:,k),us(i,:,k))
        .
        .
        .
    end do
end do

! Complex --> Real x-transform
do j = 1,mz
    do i = 1,nyp
        call fftw_execute_dft_c2r(planXb,us(i,j,:),up(i,j,:))
        .
        .
        .
    end do
end do

! Real --> Real y-transform (DCT-I)
do k = 1,mx
    do j = 1,mz
        call fftw_execute_r2r(planY,up(:,j,k),up(:,j,k))
        .
        .
        .
    end do
end do

! Do stuff here

! Inverse transforms here, reverse process above + normalizations

Notes:

I use OpenMP threading and a few compiler optimizations, not shown here. It does speed up performance quite a bit, but I want to focus on how I'm using FFTW and arranging my data to improve performance
In the full version of the code, I'm doing each transform on 58 different variables of identical size to u/us/up. Reading the FFTW documentation, they recommend making a plan for each variable you do a transform on since subsequent plans are cheap to compute, but I'm not sure how useful that is for such a large number of variables.
I have tried using fftw_plan_many_dft... for the transforms instead of using loops like I show above. However, this requires me to shuffle the data such that the transform direction index (x, y, or z) is first, and simple tests I've done prove this method to be much slower, especially as the grid size increases.
I do sequential 1D transforms in the x- and z- directions, but I can also do this as a single 2D transform. However, in a test code, I found that the 2D transform was comparable in compute time, even slightly slower.

For the sake of having a single, consolidated question:

Is it worth using FFTW's advanced/guru interface to replace these FFT loops? It seems this requires data shuffling which is quite expensive, but I'm not sure if FFTW has a good way of handling that.

Thanks!

Tips on improving FFTW performance for my Fortran solver?

Answers (0)

Related Questions