Reputation: 131
Here is my problem: I have a fortran code with a certain amount of nested loops and first I wanted to know if it's possible to optimize (rearranging) them in order to get a time gain? Second I wonder if I could use OpenMP to optimize them?
I have seen a lot of posts about nested do loops in fortran and how to optimize them but I didn't find one example that is suited to mine. I have also searched about OpenMP for nested do loops in fortran but I'm level 0 in OpenMP and it's difficult for me to know how to use it in my case.
Here are two very similar examples of loops that I have, first:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,c,d) - B(p,q,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,k,l) - B(p,q,l,k))*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,c,d) - B(p,q,d,c))*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,k,l) - B(p,q,l,k))*G(kl,ij)
end do
end do
end do
end do
end do
and the other one is:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,c,d)*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,k,l)*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,c,d)*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,k,l)*G(kl,ij)
end do
end do
end do
end do
end do
The very small difference between the two examples is mainly in the indices of the loops. I don't know if you need more info about the different integers in the loops but you have in general: nO < nOO < N < nVV. So I don't know if it's possible to optimize these loops and/or possibly put them in a way that will facilitate the use of OpenMP (I don't know yet if I will use OpenMP, it will depend on how much I can gain by optimizing the loops without it).
I already tried to rearrange the loops in different ways without any success (no time gain) and I also tried a little bit of OpenMP but I don't know much about it, so again no success.
Upvotes: 1
Views: 427
Reputation: 236
It would be helpful if you could describe the operations you'd like to perform using tensor notation and the Einstein summation rule. I have the feeling the code could be written much more succinctly using something like np.einsum
in NumPy.
For the second block of loop nests (the ones where you iterate across a square sub-section of B as opposed to a triangle) you could try to introduce some sub-programs or primitives from which the full solution is built.
Working from the bottom up, you start with a simple sum of two matrices.
!
! a_ij := a_ij + beta * b_ij
!
pure subroutine apb(A,B,beta)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:)
real(dp), intent(in) :: beta
A = A + beta*B
end subroutine
(for first code block in the original post, you would substitute this primitive with one that only updates the upper/lower triangle of the matrix)
One step higher is a tensor contraction
!
! a_ij := a_ij + b_ijkl c_kl
!
pure subroutine reduce_b(A,B,C)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:,:,:)
real(dp), intent(in) :: C(:,:)
integer :: k, l
do l = 1, size(B,4)
do k = 1, size(B,3)
call apb( A, B(:,:,k,l), C(k,l) )
end do
end do
end subroutine
Note the dimensions of C
must match the last two dimensions of B
. (In the original loop nest above, the storage order of C is swapped (i.e. c_lk
instead of c_kl
.)
Working our way upward, we have the contractions with two different sub-blocks of B
, moreover A
, C
, and D
have an additional outer dimension:
!
! A_n := A_n + B1_cd C_cdn + B2_kl D_kln
!
! The elements of A_n are a_ijn
! The elements of B1_cd are B1_ijcd
! The elements of B2_kl are B2_ijkl
!
subroutine abcd(A,B1,C,B2,D)
real(dp), intent(inout), contiguous :: A(:,:,:)
real(dp), intent(in) :: B1(:,:,:,:)
real(dp), intent(in) :: B2(:,:,:,:)
real(dp), intent(in), contiguous, target :: C(:,:), D(:,:)
real(dp), pointer :: p_C(:,:,:) => null()
real(dp), pointer :: p_D(:,:,:) => null()
integer :: k
integer :: nc, nd
nc = size(B1,3)*size(B1,4)
nd = size(B2,3)*size(B2,4)
if (nc /= size(C,1)) then
error stop "FATAL ERROR: Dimension mismatch between B1 and C"
end if
if (nd /= size(D,1)) then
error stop "FATAL ERROR: Dimension mismatch between B2 and D"
end if
! Pointer remapping of arrays C and D to rank-3
p_C(1:size(B1,3),1:size(B1,4),1:size(C,2)) => C
p_D(1:size(B2,3),1:size(B2,4),1:size(D,2)) => D
!$omp parallel do default(private) shared(A,B1,p_C,B2,p_D)
do k = 1, size(A,3)
call reduce_b( A(:,:,k), B1, p_C(:,:,k))
call reduce_b( A(:,:,k), B2, p_D(:,:,k))
end do
!$omp end parallel do
end subroutine
Finally, we reach the main level where we select the subblocks of B
program doit
use transform, only: abcd, dp
implicit none
! n0 [2,10]
!
integer, parameter :: n0 = 6
integer, parameter :: n00 = n0*n0
integer, parameter :: N, nVV
real(dp), allocatable :: A(:,:,:), B(:,:,:,:), C(:,:), D(:,:)
! N [100,200]
!
read(*,*) N
nVV = (N - n0)**2
allocate(A(N,N,nVV))
allocate(B(N,N,N,N))
allocate(C(nVV,nVV))
allocate(D(n00,nVV))
print *, "Memory occupied (MB): ", &
real(sizeof(A) + sizeof(B) + sizeof(C) + sizeof(D),dp) / 1024._dp**2
A = 0
call random_number(B)
call random_number(C)
call random_number(D)
call abcd(A=A, &
B1=B(:,:,n0+1:N,n0+1:N), &
B2=B(:,:,1:n0,1:n0), &
C=C, &
D=D)
deallocate(A,B,C,D)
end program
Similar to the answer by PierU, parallelization is on the outermost loop. On my PC, for N = 50, this re-engineered routine is about 8 times faster when executed serially. With OpenMP on 4 threads the factor is 20. For N = 100 and I got tired of waiting for the original code; the re-engineered version on 4 threads took about 3 minutes.
The full code I used for testing, configurable via environment variables (ORIG=<0|1> N=100 ./abcd
), is available here: https://gist.github.com/ivan-pi/385b3ae241e517381eb5cf84f119423d
With more fine-tuning it should be possible to bring the numbers down even further. Even better performance could be sought with a specialized library like cuTENSOR (also used under the hood of Fortran intrinsics as explained in Bringing Tensor Cores to Standard Fortran or a tool like the Tensor Contraction Engine.
One last thing I found odd was that large parts of B are un-unused. The sub sections B(:,:,1:n0,n0+1:N)
and B(:,:,n0+1:N,1:n0)
appear to be wasted space.
Upvotes: 1
Reputation: 2688
From the initial comments it may appear that at least in some cases you may be using more memory than the available RAM, which means you may be using the swap file, with all the bad consequences on the performances. To fix this, you have to either install more RAM if possible, or deeply reorganize your code to not store the full B array (by far the largest one) at once (again, if possible).
Now, let's assume that you have enough RAM. As I wrote in the comments, the access pattern on the B array is far from optimal, as the inner loops correspond to the last indeces of B, which can result in many cache misses (all the more given the the size of B). Changing the loop order if possible is a way to go.
Just looking at your first example, I am focusing on the computation of the array A (the computation of the array E looks completely independent of A, so it can be processed separately):
!! test it at first without OpenMP
!!$OMP PARALLEL DO PRIVATE(cd,c,d,kl,k,l)
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,c,d) - B(:,:,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,k,l) - B(:,:,l,k))*D(kl,ab)
end do
end do
end do
!!$OMP END PARALLEL DO
What I did:
Now the inner loops (abstracted by the array syntax) tackle contiguous elements in memory, which is much better for the performances. The code is even ready for OpenMP multithreading on the (now) outer loop.
Fortran stores the arrays in "column-major order", that is when incrementing the first index one accesses contiguous elements in memory. In C the arrays are stored in "row-major order", that is when incrementing the last index one accesses contiguous elements in memory. So a general rule is to have the inner loops on the first indeces (and the opposite in C).
Upvotes: 3