braconnier
braconnier

Reputation: 31

MPI fortran programm hangs with a limit numbers of mpi processes

I am working on a MPI application which hangs when it is launched with more than 2071 MPI processes. I have succeeded to make a small reproducer of this:

program main
use mpi
integer :: ierr,rank
call mpi_init(ierr)
call mpi_comm_rank(MPI_COMM_WORLD,rank,ierr)
if (rank.eq.0) print *,'Start'
call test_func(ierr)
if (ierr.ne.0) call exit(ierr)
call mpi_finalize(ierr)
if (rank.eq.0) print *,'Stop'

contains

subroutine test_func(ierr)
integer, intent(out) :: ierr
real :: send,recv
integer :: i,j,status(MPI_STATUS_SIZE),mpi_rank,mpi_size,ires
character(len=10) :: procname
real(kind=8) :: t1,t2
ierr=0
call mpi_comm_size(MPI_COMM_WORLD,mpi_size,ierr)
call mpi_comm_rank(MPI_COMM_WORLD,mpi_rank,ierr)
call mpi_get_processor_name(procname, ires, ierr)
call mpi_barrier(MPI_COMM_WORLD,ierr)
t1 = mpi_wtime()
do j=0,mpi_size-1
  if (mpi_rank.eq.j) then
    do i=0,mpi_size-1
      if (i.eq.j) cycle
      call MPI_RECV(recv,1,MPI_REAL,i,0,MPI_COMM_WORLD,status,ierr)
      if (ierr.ne.0) return
      if (i.eq.mpi_size-1) print *,'Rank ',j,procname,' done'
    enddo
  else
    call MPI_SEND(send,1,MPI_REAL,j,0,MPI_COMM_WORLD,ierr)
    if (ierr.ne.0) return
  endif
enddo
call mpi_barrier(MPI_COMM_WORLD,ierr)
t2 = mpi_wtime()
if (mpi_rank.eq.0) print*,"time send/recv = ",t2-t1
end subroutine test_func
end program main

When I run this program with less than 2071 MPI processes then it works but when I run it with more than 2072 processes then it hangs as if there are deadlocks on the send/recv. The outputs running the programm with I_MPI_DEBUG=5 are

[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 48487 r30i0n0 {0,24}
...
[0] MPI startup(): 2070 34737 r30i4n14 {18,19,20,42,43,44}
[0] MPI startup(): I_MPI_CC=icc
[0] MPI startup(): I_MPI_CXX=icpc
[0] MPI startup(): I_MPI_FC=ifort
[0] MPI startup(): I_MPI_F90=ifort
[0] MPI startup(): I_MPI_F77=ifort
[0] MPI startup(): I_MPI_ROOT=/data_local/sw/intel/RHEL7/compilers_and_libraries_2020.4.304/linux/mpi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_EXTRA_FILESYSTEM=1
[0] MPI startup(): I_MPI_EXTRA_FILESYSTEM_FORCE=lustre
[0] MPI startup(): I_MPI_DEBUG=5

Question 1 : Is there a reason explaining this behavior?

Notice that if I change the send/recv communication pattern by either a bcast one

do j=0,mpi_size-1
  if (mpi_rank.eq.j) then
    call MPI_BCAST(send,1,MPI_REAL,j,MPI_COMM_WORLD,ierr)
  else
    call MPI_BCAST(recv,1,MPI_REAL,j,MPI_COMM_WORLD,ierr)
  endif
  if (ierr.ne.0) return  
  print *,'Rank ',j,procname,' done'
enddo

or an allgather one

call MPI_ALLGATHER(MPI_IN_PLACE,0,MPI_DATATYPE_NULL,recv,1,MPI_REAL,MPI_COMM_WORLD,ierr)
print *,'Rank ',mpi_rank,procname,' done '

then the programm runs (faster of course) but with up to 4000 MPI processes (I did not try with more MPI processes). However, I can not change the communication send/recv pattern in the original application with the bcast or the allgather ones.

Question 2 : When I run the original application with 2064 MPI processes (86 nodes having 24 cores), the consummed memory for MPI buffers is around 60 GB per node and with 1032 MPI processes (43 nodes having 24 cores) it is around 30 GB per node. Is there a way (environment variables...) to reduce this amount of consummed memory?

Many thanks in advance for your help

Thierry

Upvotes: 3

Views: 381

Answers (1)

Jonas Thies
Jonas Thies

Reputation: 58

I agree with previous remarks that it looks like you start too many recvs on a single process, and the MPI implementation may run out of some internal buffer.

I would not recommend using Ssends or introducing barriers, though. Just replace all the MPI_Recv's by MPI_Irecv's, and the MPI_Send's by MPI_Isend's. This does not introduce any synchronization and avoids deadlocks. You can then also change the order in which you post the sends and recv's arbitrarily, for instance, you could pair sends and recvs instead of doing the recvs in a big for-loop for each outer loop iteration.

Concerning your second question about the memory consumption: probably the same: the number of recvs you post in your inner loop grows linearly with the number of processes, and so thus the amount of memory needed to handle these outstanding messages.

Finally, I also agree with previous comments that this looks very much like a case for a collective MPI call, which one exactly depends on the application code that you don't show here, but an MPI_Alltoallv may be a good one to look at.

Upvotes: 0

Related Questions