Reputation: 261
A small example serial code, which has the same structure as my code, is shown below.
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: i, j
DOUBLE PRECISION :: en,ei,es
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(i,j) = DBLE(i) + DBLE(j)
END DO
END DO
DO i = 1, 200, 1
en = 2.0d0/DBLE(200)*(i-1)-1.0d0
et(i) = en
es = 0.0d0
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
STOP
END PROGRAM MAIN
SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
gf = 1.0d0 / (en - kn(i) * p)
ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL
I am running my code on the cluster, which has 32 CPUs on one node, and there are totally 250 GB memory shared by 32 CPUs on one node. I can use 32 nodes maximumly.
Every time when the inner Loop is done, there is one data to be collected. After all outer Loops are done, there are totally 200 data to be collected. If only the inner Loop is executed by one CPU, it would take more than 3 days (more than 72 hours).
I want to do the parallelization for both inner Loop and outer Loop respectively? Would anyone please suggest how to parallelize this code?
Can I use MPI technique for both inner Loop and outer Loop respectively? If so, how to differentiate different CPUs that execute different Loops (inner Loop and outer Loop)?
On the other hand, I saw someone mention the parallelization with hybrid MPI and OpenMP method. Can I use MPI technique for the outer Loop and OpenMP technique for the inner Loop? If so, how to collect one data to the CPU after every inner Loop is done each time and collect 200 data in total to CPU after all outer Loops are done. How to differentiate different CPUs that execute inner Loop and outer Loop respectively?
Alternatively, would anyone provide any other suggestion on parallelizing the code and enhance the efficiency? Thank you very much in advance.
Upvotes: 2
Views: 1189
Reputation: 190
As mentioned in the comments, a good answer will require more detailed question. However, at a first sight it seems that parallelizing the internal loop
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
should be enough to solve your problem, or at least it will be a good starter. First of all I guess that there is an error on the loop
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(j,k) = DBLE(j) + DBLE(k)
END DO
END Do
since the k is set to 0 and and there is no cell with address corresponding to 0 (see your variable declaration). Also ki is declared ki(1000,2000) array while ki(j,i) is (2000,1000) array. Beside these error, I guess that ki should be calculated as
ki(i,j) = DBLE(j) + DBLE(i)
if true, I suggest you the following solution
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: i, j, k,icr,icr0,icr1
DOUBLE PRECISION :: en,ei,es,timerRate
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
INTEGER,PARAMETER:: nthreads=1
call system_clock(count_rate=icr)
timerRate=real(icr)
call system_clock(icr0)
call omp_set_num_threads(nthreads)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(i,j) = DBLE(j) + DBLE(i)
END DO
END DO
DO i = 1, 200, 1
en = 2.0d0/DBLE(200)*(i-1)-1.0d0
et(i) = en
es = 0.0d0
!$OMP PARALLEL DO private(j,kn,ei) firstpribate(en) shared(ki) reduction(+:es)
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
!$OMP END PARALLEL DO
WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
call system_clock(icr1)
write (*,*) (icr1-icr0)/timerRate ! return computing time
STOP
END PROGRAM MAIN
SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
gf = 1.0d0 / (en - kn(i) * p)
ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL
I add some variables to check the computing time ;-).
This solution is computed in 5.14 s, for nthreads=1, and in 2.75 s, for nthreads=2. It does not divide the computing time by 2, but it seems to be a good deal for a first shot. Unfortunately, on this machine I have a core i3 proc. So I can't do better than nthreads=2. However, I wonder, how the code will behave with nthreads=16 ???
Please let me know
I hope that this helps you.
Finally, I warn about the choice of variables status (private, firstprivate and shared) that might be consider carefully in the real code.
Upvotes: 3