magnanimousllamacopter
magnanimousllamacopter

Reputation: 456

MPIRUN Segmentation fault whenever I use a hostfile

I am trying to run a program that I wrote on two machines using MPI. It runs perfectly fine using 4 cores on the local machine, when launched with MPIRUN. I have already configured ssh so that the local machine can log into the remote machine without a password. Whenever I run MPIRUN and specify a host file, I get a segmentation fault and "Address Not Mapped: (some-address)". The address changes every time I run it. Sometimes it is just (nil). This also happens when I try to use a hostfile and I run the ring_c sample. I have OpenMPI 3.1.2 installed on both computers and for the user associated with the job.

Hostfile contents

localhost
[email protected]

I have also tried using the hostname ubuntu-vm in the hostfile. This hostname is in my /etc/hosts file. When I type ssh mpiuser@ubuntu-vm or ssh [email protected], it logs me in without issue and without a password prompt. I have tried reinstalling OpenMPI several times, on both computers.

Is it possible that this is an OpenMPI specific issue? Could mpich potentially work? I don't understand why this is so hard to get working. I assumed that using the standard installation instructions and running a sample program would not be problematic.

I am using ubuntu 18.04 on both machines. The remote machine is a VM in a windows 10 host. Bridged network adapter configuration. I am putting the programs into a shared folder that is accessible from both machines before I attempt to run them. In case my earlier statement wasn't clear, the sample program ring_c also fails when running on multiple machines, but not on the local machine.

Command line:

MPIRUN -np 8 --hostfile hostfile ./ring_c

Sample Error Output:

======================   ALLOCATED NODES   ======================
ubuntu-desktop: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
192.168.1.236: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
[ubuntu-desktop:11654] *** Process received signal ***
[ubuntu-desktop:11654] Signal: Segmentation fault (11)
[ubuntu-desktop:11654] Signal code: Address not mapped (1)
[ubuntu-desktop:11654] Failing at address: 0x10

Upvotes: 3

Views: 719

Answers (1)

Gilles Gouaillardet
Gilles Gouaillardet

Reputation: 8395

This is a genuine bug in Open MPI (a double free error) and it has been fixed in the master branch at https://github.com/open-mpi/ompi/pull/5863.

Meanwhile, you can manually download and apply the patch available at https://github.com/open-mpi/ompi/pull/5869

Note the Open MPI users mailing list, or the github repo (https://github.com/open-mpi/ompi) is the best place to report this kind of issue. (mpirun should never crash, so it is very unlikely a programming error)

Upvotes: 3

Related Questions