How to parallelize the nested loop

Question

A small example serial code, which has the same structure as my code, is shown below.

PROGRAM MAIN
IMPLICIT NONE
INTEGER          :: i, j
DOUBLE PRECISION :: en,ei,es
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
   DO j = 1, 2000, 1
      ki(i,j) = DBLE(i) + DBLE(j)
   END DO
END DO
DO i = 1, 200, 1
   en = 2.0d0/DBLE(200)*(i-1)-1.0d0
   et(i) = en
   es = 0.0d0
   DO j = 1, 1000, 1
      kn=ki(j,:)
      CALL CAL(en,kn,ei)
      es = es + ei
   END DO
   WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
STOP
END PROGRAM MAIN

SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER          :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
   gf = 1.0d0 / (en - kn(i) * p)
   ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL

I am running my code on the cluster, which has 32 CPUs on one node, and there are totally 250 GB memory shared by 32 CPUs on one node. I can use 32 nodes maximumly.

Every time when the inner Loop is done, there is one data to be collected. After all outer Loops are done, there are totally 200 data to be collected. If only the inner Loop is executed by one CPU, it would take more than 3 days (more than 72 hours).

I want to do the parallelization for both inner Loop and outer Loop respectively? Would anyone please suggest how to parallelize this code?

Can I use MPI technique for both inner Loop and outer Loop respectively? If so, how to differentiate different CPUs that execute different Loops (inner Loop and outer Loop)?

On the other hand, I saw someone mention the parallelization with hybrid MPI and OpenMP method. Can I use MPI technique for the outer Loop and OpenMP technique for the inner Loop? If so, how to collect one data to the CPU after every inner Loop is done each time and collect 200 data in total to CPU after all outer Loops are done. How to differentiate different CPUs that execute inner Loop and outer Loop respectively?

Alternatively, would anyone provide any other suggestion on parallelizing the code and enhance the efficiency? Thank you very much in advance.

How to parallelize the nested loop

Answers (1)

Related Questions