user13132640
user13132640

Reputation: 339

mpi4py MPI_INIT fail, using Python 3.11 & OpenMPI 4.1.7

Here's an example Python script I'm trying to run

import mpi4py.futures as mp

def some_maths(x,y):
    return (x**2)/(1+y)

if __name__=='__main__':
    multiargs = [(1, 5), (2, 6), (3, 7), (5, 8), (7, 9), (9, 10)]

    # Parallel execution
    _PoolExecutor = mp.MPIPoolExecutor
    with _PoolExecutor(max_workers=len(multiargs)) as p:
        out = p.starmap(some_maths, multiargs)

    for r in out:
        print(r)

We are upgrading from python 3.10 to 3.13. In 3.10, with mpi4py==3.1.5, this runs fine. In 3.13, regardless of whether I use mpi4py=3.1.5, or newer 4.0.1, I get MPI communication error:

MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.

There are different versions of openmpi installed in the system, so I went to check when I run the script in python 3.10, where it is working, and found (via mpi4py.MPI.Get_version()) it is using OpenMPI 2.1.1. In the newer python installation, it is using OpenMPI 4.1.7.

I enabled verbose as the third recommendation says, but it wasn't helpful at least to me:

 mca: base: components_register: registering framework btl components
 mca: base: components_register: found loaded component self
 mca: base: components_register: component self register function successful
 mca: base: components_open: opening btl components
 mca: base: components_open: found loaded component self
 mca: base: components_open: component self open function successful
 select: initializing btl component self
 select: init of component self returned success
... above repeated several times ... 
 mca: bml: Using self btl for send to [[43610,2],0] on node base
 mca: bml: Using self btl for send to [[43610,2],1] on node base
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
  *** An error occurred in MPI_Init_thread
  *** reported by process [2858024962,1]
  *** on a NULL communicator
  *** Unknown error
  *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
  ***    and potentially your MPI job)
  *** An error occurred in MPI_Init_thread
  *** reported by process [2858024962,0]
  *** on a NULL communicator
  *** Unknown error
  *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
  ***    and potentially your MPI job)

My initial thought is that I could just reinstall mpi4py for my 3.13 installation, and specify in the install that it should use the older openmpi. It seems this is done via:

$ env MPICC=/path/to/mpicc python -m pip install mpi4py

First, is this a reasonable approach? second, is there a simple way for me to find my mpicc path corresponding to the older version? I've found a folder on the system for this version but no mpicc in it.

Finally, any other troubleshooting steps I should try here?

Upvotes: 0

Views: 17

Answers (0)

Related Questions