MPI fortran programm hangs with a limit numbers of mpi processes

Question

I am working on a MPI application which hangs when it is launched with more than 2071 MPI processes. I have succeeded to make a small reproducer of this:

program main
use mpi
integer :: ierr,rank
call mpi_init(ierr)
call mpi_comm_rank(MPI_COMM_WORLD,rank,ierr)
if (rank.eq.0) print *,'Start'
call test_func(ierr)
if (ierr.ne.0) call exit(ierr)
call mpi_finalize(ierr)
if (rank.eq.0) print *,'Stop'

contains

subroutine test_func(ierr)
integer, intent(out) :: ierr
real :: send,recv
integer :: i,j,status(MPI_STATUS_SIZE),mpi_rank,mpi_size,ires
character(len=10) :: procname
real(kind=8) :: t1,t2
ierr=0
call mpi_comm_size(MPI_COMM_WORLD,mpi_size,ierr)
call mpi_comm_rank(MPI_COMM_WORLD,mpi_rank,ierr)
call mpi_get_processor_name(procname, ires, ierr)
call mpi_barrier(MPI_COMM_WORLD,ierr)
t1 = mpi_wtime()
do j=0,mpi_size-1
  if (mpi_rank.eq.j) then
    do i=0,mpi_size-1
      if (i.eq.j) cycle
      call MPI_RECV(recv,1,MPI_REAL,i,0,MPI_COMM_WORLD,status,ierr)
      if (ierr.ne.0) return
      if (i.eq.mpi_size-1) print *,'Rank ',j,procname,' done'
    enddo
  else
    call MPI_SEND(send,1,MPI_REAL,j,0,MPI_COMM_WORLD,ierr)
    if (ierr.ne.0) return
  endif
enddo
call mpi_barrier(MPI_COMM_WORLD,ierr)
t2 = mpi_wtime()
if (mpi_rank.eq.0) print*,"time send/recv = ",t2-t1
end subroutine test_func
end program main

When I run this program with less than 2071 MPI processes then it works but when I run it with more than 2072 processes then it hangs as if there are deadlocks on the send/recv. The outputs running the programm with I_MPI_DEBUG=5 are

[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: verbs;ofi_rxm
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 48487 r30i0n0 {0,24}
...
[0] MPI startup(): 2070 34737 r30i4n14 {18,19,20,42,43,44}
[0] MPI startup(): I_MPI_CC=icc
[0] MPI startup(): I_MPI_CXX=icpc
[0] MPI startup(): I_MPI_FC=ifort
[0] MPI startup(): I_MPI_F90=ifort
[0] MPI startup(): I_MPI_F77=ifort
[0] MPI startup(): I_MPI_ROOT=/data_local/sw/intel/RHEL7/compilers_and_libraries_2020.4.304/linux/mpi
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_RMK=lsf
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_EXTRA_FILESYSTEM=1
[0] MPI startup(): I_MPI_EXTRA_FILESYSTEM_FORCE=lustre
[0] MPI startup(): I_MPI_DEBUG=5

Question 1 : Is there a reason explaining this behavior?

Notice that if I change the send/recv communication pattern by either a bcast one

do j=0,mpi_size-1
  if (mpi_rank.eq.j) then
    call MPI_BCAST(send,1,MPI_REAL,j,MPI_COMM_WORLD,ierr)
  else
    call MPI_BCAST(recv,1,MPI_REAL,j,MPI_COMM_WORLD,ierr)
  endif
  if (ierr.ne.0) return  
  print *,'Rank ',j,procname,' done'
enddo

or an allgather one

call MPI_ALLGATHER(MPI_IN_PLACE,0,MPI_DATATYPE_NULL,recv,1,MPI_REAL,MPI_COMM_WORLD,ierr)
print *,'Rank ',mpi_rank,procname,' done '

then the programm runs (faster of course) but with up to 4000 MPI processes (I did not try with more MPI processes). However, I can not change the communication send/recv pattern in the original application with the bcast or the allgather ones.

Question 2 : When I run the original application with 2064 MPI processes (86 nodes having 24 cores), the consummed memory for MPI buffers is around 60 GB per node and with 1032 MPI processes (43 nodes having 24 cores) it is around 30 GB per node. Is there a way (environment variables...) to reduce this amount of consummed memory?

Many thanks in advance for your help

Thierry

MPI fortran programm hangs with a limit numbers of mpi processes

Answers (1)

Related Questions