Reputation: 11
I am a graduate student (master's) and use an in-house code for running my simulations that use MPI. Earlier, I used OpenMPI on a supercomputer we used to access and since it shut down I've been trying to switch to another supercomputer that has Intel MPI installed on it. The problem is, the same code that was working perfectly fine earlier now gives memory leaks after a set number of iterations (time steps). Since the code is relatively large and my knowledge of MPI is very basic, it is proving very difficult to debug it. So I installed OpenMPI onto this new supercomputer I am using, but it gives the following error message upon execution and then terminates:
Invalid number of PE Please check partitioning pattern or number of PE
NOTE: The error message is repeated for as many numbers of nodes I used to run the case (here, 8). Compiled using mpif90 with -fopenmp for thread parallelisation.
There is in fact no guarantee that running it on OpenMPI won't give the memory leak, but it is worth a shot I feel, as it was running perfectly fine earlier.
PS: On Intel MPI, this is the error I got (compiled with mpiifort with -qopenmp)
Abort(941211497) on node 16 (rank 16 in comm 0): Fatal error in PMPI_Isend: >Unknown error class, error stack: PMPI_Isend(152)...........: MPI_Isend(buf=0x2aba1cbc8060, count=4900, dtype=0x4c000829, dest=20, tag=0, MPI_COMM_WORLD, request=0x7ffec8586e5c) failed MPID_Isend(662)...........: MPID_isend_unsafe(282)....: MPIDI_OFI_send_normal(305): failure occurred while allocating memory for a request object Abort(203013993) on node 17 (rank 17 in comm 0): Fatal error in PMPI_Isend: >Unknown error class, error stack: PMPI_Isend(152)...........: MPI_Isend(buf=0x2b38c479c060, count=4900, dtype=0x4c000829, dest=21, tag=0, MPI_COMM_WORLD, request=0x7fffc20097dc) failed MPID_Isend(662)...........: MPID_isend_unsafe(282)....: MPIDI_OFI_send_normal(305): failure occurred while allocating memory for a request object [[email protected]] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:357): write error (Bad file descriptor) [[email protected]] cmd_bcast_root (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:164): error sending cmd 15 to proxy [[email protected]] send_abort_rank_downstream (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:557): unable to send response downstream [[email protected]] control_cb (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1576): unable to send abort rank to downstreams [[email protected]] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:79): callback returned error status [[email protected]] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1962): error waiting for event"
I will be happy to provide the code in case somebody is willing to take a look at it. It is written using Fortran with some of the functions written in C. My research progress has been completely halted due to this problem and nobody at my lab has enough experience with MPI to resolve this.
Upvotes: 1
Views: 325