oxedions
oxedions

Reputation: 61

Optimizing a loop : huge arrays operations

I am doing huge calculations (derivatives here, but look similar to images operations) on arrays that do not fit in cache, meaning the CPU have to load parts in the cache, calculate, then load another part, etc. But because of the shape of calculations, some data are load, unload and reload multiple times. I was wondering if there is a way to optimize this. I am already using SIMD instructions using the compiler optimization (GCC and Intel).

This is Fortran calculations, but it's similar to C/C++, the memory order is just inverted and arrays use () instead of []. for is replaced by do.

On x axe:

   do k=1,N(3)
      do j=1,N(2)
         do i=3,N(1)
            DF(i,j,k)=(F(i+1,j,k)-F(i-1,j,k))*B+(F(i-2,j,k)-F(i+2,j,k))*C
         end do
      end do
   end do

On y axe:

   do k=1,N(3)
      do j=1,N(2)
         do i=3,N(1)
            DF(i,j,k)=(F(i,j+1,k)-F(i,j-1,k))*B+(F(i,j-2,k)-F(i,j+2,k))*C
         end do
      end do
   end do

on z axe:

   do k=1,N(3)
      do j=1,N(2)
         do i=3,N(1)
            DF(i,j,k)=(F(i,j,k+1)-F(i,j,k-1))*B+(F(i,j,k-2)-F(i,j,k+2))*C
         end do
      end do
   end do

First derivative on axe x is OK because memory is read continuously. Derivatives on y and z axes are not continuous.

And the worst calculation I have combine all axes (This is a Laplacian operator):

 do k=1,N(3)
    do j=1,N(2)
       do i=3,N(1)
          V(i,j,k) = M(i,j,k,1) * p(i,j,k) &
                 & + M(i,j,k,2) * p(i-1,j,k) &
                 & + M(i,j,k,3) * p(i+1,j,k) &
                 & + M(i,j,k,4) * p(i,j-1,k) &
                 & + M(i,j,k,5) * p(i,j+1,k) &
                 & + M(i,j,k,6) * p(i,j,k-1) &
                 & + M(i,j,k,7) * p(i,j,k+1)
       end do
    end do
 end do

Note that compilers do not understand the last operation (Laplacian). To use SIMD (vectorized calculations), I need to split the operation like this, which gives a 2.5x speedup:

 do k=1,N(3)
    do j=1,N(2)
       do i=3,N(1)
          V(i,j,k) = M(i,j,k,1) * p(i,j,k) &
                 & + M(i,j,k,2) * p(i-1,j,k) &
                 & + M(i,j,k,3) * p(i+1,j,k)
       end do
    end do
 end do
 do k=1,N(3)
    do j=1,N(2)
       do i=3,N(1)
          V(i,j,k) = V(i,j,k) + &
                 & + M(i,j,k,4) * p(i,j-1,k) &
                 & + M(i,j,k,5) * p(i,j+1,k)
       end do
    end do
 end do
 do k=1,N(3)
    do j=1,N(2)
       do i=3,N(1)
          V(i,j,k) = V(i,j,k) + &
                 & + M(i,j,k,6) * p(i,j,k-1) &
                 & + M(i,j,k,7) * p(i,j,k+1)
       end do
    end do
 end do

Maybe using SIMD I already reached maximum speed, but because these calculations takes days, even with MPI and more than 1024 CPU, reducing the time of calculations, even of 20% would be a great step! Does anyone of you have ideas on how to optimize this?

Upvotes: 3

Views: 142

Answers (1)

When you use 3D stencils and you reference elements like i,j,k-1, i,j,k+1 the linear order in which you go through the array will not be optimal. The cache efficiency can be increased by loop tiling.

In my code I use

!$omp parallel private(i,j,k,bi,bj,bk)
!$omp do schedule(runtime) collapse(3)
do bk = 1, Unz, tnz
 do bj = 1, Uny, tny
  do bi = 1, Unx, tnx
   do k = bk, min(bk+tnz-1,Unz)
    do j = bj, min(bj+tny-1,Uny)
     do i = bi, min(bi+tnx-1,Unx)
         U2 (i,j,k) = U2(i,j,k) + &
           (U(i+1,j,k)-U(i,j,k)) * ...
         U2(i,j,k) = U2(i,j,k) - &
            (U(i,j,k)-U(i-1,j,k)) * ...
         U2(i,j,k) = U2(i,j,k) + &
            (U(i,j+1,k)-U(i,j,k)) * ...
         U2(i,j,k) = U2(i,j,k) - &
            (U(i,j,k)-U(i,j-1,k)) * ...
         U2(i,j,k) = U2(i,j,k) + &
            (U(i,j,k+1)-U(i,j,k)) * ...
         U2(i,j,k) = U2(i,j,k) - &
            (U(i,j,k)-U(i,j,k-1)) * ...
     end do
    end do
   end do
  end do
 end do
end do
!$omp end do

where tnx, tny, tnz are the sizes of a tile in which you iterate in the i,j,k order. The size must be set to be close to the L1 cache. This will increase reuse of the content loaded into the cache.

If you need to separate the directions, you can of course do that and still keep the tiling.

Upvotes: 3

Related Questions