Optimizing a loop : huge arrays operations

Question

I am doing huge calculations (derivatives here, but look similar to images operations) on arrays that do not fit in cache, meaning the CPU have to load parts in the cache, calculate, then load another part, etc. But because of the shape of calculations, some data are load, unload and reload multiple times. I was wondering if there is a way to optimize this. I am already using SIMD instructions using the compiler optimization (GCC and Intel).

This is Fortran calculations, but it's similar to C/C++, the memory order is just inverted and arrays use () instead of []. for is replaced by do.

On x axe:

   do k=1,N(3)
      do j=1,N(2)
         do i=3,N(1)
            DF(i,j,k)=(F(i+1,j,k)-F(i-1,j,k))*B+(F(i-2,j,k)-F(i+2,j,k))*C
         end do
      end do
   end do

On y axe:

   do k=1,N(3)
      do j=1,N(2)
         do i=3,N(1)
            DF(i,j,k)=(F(i,j+1,k)-F(i,j-1,k))*B+(F(i,j-2,k)-F(i,j+2,k))*C
         end do
      end do
   end do

on z axe:

   do k=1,N(3)
      do j=1,N(2)
         do i=3,N(1)
            DF(i,j,k)=(F(i,j,k+1)-F(i,j,k-1))*B+(F(i,j,k-2)-F(i,j,k+2))*C
         end do
      end do
   end do

First derivative on axe x is OK because memory is read continuously. Derivatives on y and z axes are not continuous.

And the worst calculation I have combine all axes (This is a Laplacian operator):

 do k=1,N(3)
    do j=1,N(2)
       do i=3,N(1)
          V(i,j,k) = M(i,j,k,1) * p(i,j,k) &
                 & + M(i,j,k,2) * p(i-1,j,k) &
                 & + M(i,j,k,3) * p(i+1,j,k) &
                 & + M(i,j,k,4) * p(i,j-1,k) &
                 & + M(i,j,k,5) * p(i,j+1,k) &
                 & + M(i,j,k,6) * p(i,j,k-1) &
                 & + M(i,j,k,7) * p(i,j,k+1)
       end do
    end do
 end do

Note that compilers do not understand the last operation (Laplacian). To use SIMD (vectorized calculations), I need to split the operation like this, which gives a 2.5x speedup:

 do k=1,N(3)
    do j=1,N(2)
       do i=3,N(1)
          V(i,j,k) = M(i,j,k,1) * p(i,j,k) &
                 & + M(i,j,k,2) * p(i-1,j,k) &
                 & + M(i,j,k,3) * p(i+1,j,k)
       end do
    end do
 end do
 do k=1,N(3)
    do j=1,N(2)
       do i=3,N(1)
          V(i,j,k) = V(i,j,k) + &
                 & + M(i,j,k,4) * p(i,j-1,k) &
                 & + M(i,j,k,5) * p(i,j+1,k)
       end do
    end do
 end do
 do k=1,N(3)
    do j=1,N(2)
       do i=3,N(1)
          V(i,j,k) = V(i,j,k) + &
                 & + M(i,j,k,6) * p(i,j,k-1) &
                 & + M(i,j,k,7) * p(i,j,k+1)
       end do
    end do
 end do

Maybe using SIMD I already reached maximum speed, but because these calculations takes days, even with MPI and more than 1024 CPU, reducing the time of calculations, even of 20% would be a great step! Does anyone of you have ideas on how to optimize this?

Optimizing a loop : huge arrays operations

Answers (1)

Related Questions