Reputation: 131

Is it possible to optimize these fortran loops?

Here is my problem: I have a fortran code with a certain amount of nested loops and first I wanted to know if it's possible to optimize (rearranging) them in order to get a time gain? Second I wonder if I could use OpenMP to optimize them?

I have seen a lot of posts about nested do loops in fortran and how to optimize them but I didn't find one example that is suited to mine. I have also searched about OpenMP for nested do loops in fortran but I'm level 0 in OpenMP and it's difficult for me to know how to use it in my case.

Here are two very similar examples of loops that I have, first:

do p=1,N
  do q=1,N

    do ab=1,nVV

      cd = 0
      do c=nO+1,N
        do d=c+1,N

          cd = cd + 1
          A(p,q,ab) = A(p,q,ab) + (B(p,q,c,d) - B(p,q,d,c))*C(cd,ab)

        end do
      end do
      
      kl = 0
        do k=1,nO
          do l=k+1,nO

            kl = kl + 1
            A(p,q,ab) = A(p,q,ab) + (B(p,q,k,l) - B(p,q,l,k))*D(kl,ab)

            end do
          end do

        end do

       do ij=1,nOO
         
         cd = 0
         do c=nO+1,N
           do d=c+1,N

             cd = cd + 1
              E(p,q,ij) = E(p,q,ij) + (B(p,q,c,d) - B(p,q,d,c))*F(cd,ij)

            end do
          end do


          kl = 0
          do k=1,nO
            do l=k+1,nO

              kl = kl + 1
              E(p,q,ij) = E(p,q,ij) + (B(p,q,k,l) - B(p,q,l,k))*G(kl,ij)

            end do
          end do

        end do

      end do
    end do

and the other one is:

    do p=1,N
      do q=1,N

        do ab=1,nVV

          cd = 0
          do c=nO+1,N
            do d=nO+1,N

              cd = cd + 1
              A(p,q,ab) = A(p,q,ab) + B(p,q,c,d)*C(cd,ab)

            end do
          end do

          kl = 0
          do k=1,nO
            do l=1,nO

              kl = kl + 1
              A(p,q,ab) = A(p,q,ab) + B(p,q,k,l)*D(kl,ab)

            end do
          end do

        end do

        do ij=1,nOO

          cd = 0
          do c=nO+1,N
            do d=nO+1,N

              cd = cd + 1
              E(p,q,ij) = E(p,q,ij) + B(p,q,c,d)*F(cd,ij)

            end do
          end do

          kl = 0
          do k=1,nO
            do l=1,nO

              kl = kl + 1
              E(p,q,ij) = E(p,q,ij) + B(p,q,k,l)*G(kl,ij)

            end do
          end do

        end do

      end do
    end do

The very small difference between the two examples is mainly in the indices of the loops. I don't know if you need more info about the different integers in the loops but you have in general: nO < nOO < N < nVV. So I don't know if it's possible to optimize these loops and/or possibly put them in a way that will facilitate the use of OpenMP (I don't know yet if I will use OpenMP, it will depend on how much I can gain by optimizing the loops without it).

I already tried to rearrange the loops in different ways without any success (no time gain) and I also tried a little bit of OpenMP but I don't know much about it, so again no success.

Upvotes: 1

Answers (2)

IPribec

Reputation: 236

It would be helpful if you could describe the operations you'd like to perform using tensor notation and the Einstein summation rule. I have the feeling the code could be written much more succinctly using something like np.einsum in NumPy.

For the second block of loop nests (the ones where you iterate across a square sub-section of B as opposed to a triangle) you could try to introduce some sub-programs or primitives from which the full solution is built.

Working from the bottom up, you start with a simple sum of two matrices.

!
! a_ij := a_ij + beta * b_ij 
!
pure subroutine apb(A,B,beta)
   real(dp), intent(inout) :: A(:,:)
   real(dp), intent(in) :: B(:,:)
   real(dp), intent(in) :: beta
   A = A + beta*B
end subroutine

(for first code block in the original post, you would substitute this primitive with one that only updates the upper/lower triangle of the matrix)

One step higher is a tensor contraction

!
! a_ij := a_ij + b_ijkl c_kl
!
pure subroutine reduce_b(A,B,C)
   real(dp), intent(inout) :: A(:,:)
   real(dp), intent(in) :: B(:,:,:,:)
   real(dp), intent(in) :: C(:,:)

   integer :: k, l

   do l = 1, size(B,4)
      do k = 1, size(B,3)

         call apb( A, B(:,:,k,l), C(k,l) )
   
      end do
   end do

end subroutine

Note the dimensions of C must match the last two dimensions of B. (In the original loop nest above, the storage order of C is swapped (i.e. c_lk instead of c_kl.)

Working our way upward, we have the contractions with two different sub-blocks of B, moreover A, C, and D have an additional outer dimension:

!
! A_n := A_n  +  B1_cd C_cdn +  B2_kl D_kln
!
!   The elements of  A_n    are  a_ijn
!   The elements of  B1_cd  are  B1_ijcd
!   The elements of  B2_kl  are  B2_ijkl
!
subroutine abcd(A,B1,C,B2,D)
   real(dp), intent(inout), contiguous :: A(:,:,:)
   real(dp), intent(in) :: B1(:,:,:,:)
   real(dp), intent(in) :: B2(:,:,:,:)
   real(dp), intent(in), contiguous, target :: C(:,:), D(:,:)

   real(dp), pointer :: p_C(:,:,:) => null()
   real(dp), pointer :: p_D(:,:,:) => null()

   integer :: k
   integer :: nc, nd

   nc = size(B1,3)*size(B1,4)
   nd = size(B2,3)*size(B2,4)

   if (nc /= size(C,1)) then
      error stop "FATAL ERROR: Dimension mismatch between B1 and C"
   end if

   if (nd /= size(D,1)) then
      error stop "FATAL ERROR: Dimension mismatch between B2 and D"
   end if

   ! Pointer remapping of arrays C and D to rank-3
   p_C(1:size(B1,3),1:size(B1,4),1:size(C,2)) => C
   p_D(1:size(B2,3),1:size(B2,4),1:size(D,2)) => D

   !$omp parallel do default(private) shared(A,B1,p_C,B2,p_D)
   do k = 1, size(A,3)

      call reduce_b( A(:,:,k), B1, p_C(:,:,k))
      call reduce_b( A(:,:,k), B2, p_D(:,:,k))

   end do
   !$omp end parallel do

end subroutine

Finally, we reach the main level where we select the subblocks of B

program doit

   use transform, only: abcd, dp

   implicit none

   ! n0  [2,10]
   !
   integer, parameter :: n0  = 6
   integer, parameter :: n00 = n0*n0

   integer, parameter :: N, nVV

   real(dp), allocatable :: A(:,:,:), B(:,:,:,:), C(:,:), D(:,:)

   ! N   [100,200]
   !
   read(*,*) N
   nVV = (N - n0)**2

   allocate(A(N,N,nVV))
   allocate(B(N,N,N,N))
   allocate(C(nVV,nVV))
   allocate(D(n00,nVV))

   print *, "Memory occupied (MB): ", & 
      real(sizeof(A) + sizeof(B) + sizeof(C) + sizeof(D),dp) / 1024._dp**2

   A = 0
   call random_number(B)
   call random_number(C)
   call random_number(D)

   call abcd(A=A, &
             B1=B(:,:,n0+1:N,n0+1:N), &
             B2=B(:,:,1:n0,1:n0), &
             C=C, &
             D=D)

   deallocate(A,B,C,D)

end program

Similar to the answer by PierU, parallelization is on the outermost loop. On my PC, for N = 50, this re-engineered routine is about 8 times faster when executed serially. With OpenMP on 4 threads the factor is 20. For N = 100 and I got tired of waiting for the original code; the re-engineered version on 4 threads took about 3 minutes.

The full code I used for testing, configurable via environment variables (ORIG=<0|1> N=100 ./abcd), is available here: https://gist.github.com/ivan-pi/385b3ae241e517381eb5cf84f119423d

With more fine-tuning it should be possible to bring the numbers down even further. Even better performance could be sought with a specialized library like cuTENSOR (also used under the hood of Fortran intrinsics as explained in Bringing Tensor Cores to Standard Fortran or a tool like the Tensor Contraction Engine.

One last thing I found odd was that large parts of B are un-unused. The sub sections B(:,:,1:n0,n0+1:N) and B(:,:,n0+1:N,1:n0) appear to be wasted space.

Upvotes: 1

PierU

Reputation: 2688

From the initial comments it may appear that at least in some cases you may be using more memory than the available RAM, which means you may be using the swap file, with all the bad consequences on the performances. To fix this, you have to either install more RAM if possible, or deeply reorganize your code to not store the full B array (by far the largest one) at once (again, if possible).

Now, let's assume that you have enough RAM. As I wrote in the comments, the access pattern on the B array is far from optimal, as the inner loops correspond to the last indeces of B, which can result in many cache misses (all the more given the the size of B). Changing the loop order if possible is a way to go.

Just looking at your first example, I am focusing on the computation of the array A (the computation of the array E looks completely independent of A, so it can be processed separately):

!! test it at first without OpenMP
!!$OMP PARALLEL DO PRIVATE(cd,c,d,kl,k,l)
do ab=1,nVV
   cd = 0
   do c=nO+1,N
      do d=c+1,N
         cd = cd + 1         
         A(:,:,ab) = A(:,:,ab) + (B(:,:,c,d) - B(:,:,d,c))*C(cd,ab)
      end do
   end do
   kl = 0
   do k=1,nO
      do l=k+1,nO
         kl = kl + 1
         A(:,:,ab) = A(:,:,ab) + (B(:,:,k,l) - B(:,:,l,k))*D(kl,ab)
      end do
   end do
end do  
!!$OMP END PARALLEL DO

What I did:

moved the loops on p and q from outer to inner positions (it's not always as easy than it is here)
replaced them with array syntax (no performance gain to expect, just a code easier to read)

Now the inner loops (abstracted by the array syntax) tackle contiguous elements in memory, which is much better for the performances. The code is even ready for OpenMP multithreading on the (now) outer loop.

EDIT/Hint

Fortran stores the arrays in "column-major order", that is when incrementing the first index one accesses contiguous elements in memory. In C the arrays are stored in "row-major order", that is when incrementing the last index one accesses contiguous elements in memory. So a general rule is to have the inner loops on the first indeces (and the opposite in C).

Upvotes: 3

Is it possible to optimize these fortran loops?

Answers (2)

EDIT/Hint

Related Questions