Reputation: 456
I am trying to run a program that I wrote on two machines using MPI. It runs perfectly fine using 4 cores on the local machine, when launched with MPIRUN. I have already configured ssh so that the local machine can log into the remote machine without a password. Whenever I run MPIRUN and specify a host file, I get a segmentation fault and "Address Not Mapped: (some-address)". The address changes every time I run it. Sometimes it is just (nil). This also happens when I try to use a hostfile and I run the ring_c sample. I have OpenMPI 3.1.2 installed on both computers and for the user associated with the job.
Hostfile contents
localhost
[email protected]
I have also tried using the hostname ubuntu-vm in the hostfile. This hostname is in my /etc/hosts file. When I type ssh mpiuser@ubuntu-vm or ssh [email protected], it logs me in without issue and without a password prompt. I have tried reinstalling OpenMPI several times, on both computers.
Is it possible that this is an OpenMPI specific issue? Could mpich potentially work? I don't understand why this is so hard to get working. I assumed that using the standard installation instructions and running a sample program would not be problematic.
I am using ubuntu 18.04 on both machines. The remote machine is a VM in a windows 10 host. Bridged network adapter configuration. I am putting the programs into a shared folder that is accessible from both machines before I attempt to run them. In case my earlier statement wasn't clear, the sample program ring_c also fails when running on multiple machines, but not on the local machine.
Command line:
MPIRUN -np 8 --hostfile hostfile ./ring_c
Sample Error Output:
====================== ALLOCATED NODES ======================
ubuntu-desktop: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
192.168.1.236: flags=0x10 slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
[ubuntu-desktop:11654] *** Process received signal ***
[ubuntu-desktop:11654] Signal: Segmentation fault (11)
[ubuntu-desktop:11654] Signal code: Address not mapped (1)
[ubuntu-desktop:11654] Failing at address: 0x10
Upvotes: 3
Views: 719
Reputation: 8395
This is a genuine bug in Open MPI (a double free error) and it has been fixed in the master
branch at https://github.com/open-mpi/ompi/pull/5863.
Meanwhile, you can manually download and apply the patch available at https://github.com/open-mpi/ompi/pull/5869
Note the Open MPI users mailing list, or the github repo (https://github.com/open-mpi/ompi) is the best place to report this kind of issue.
(mpirun
should never crash, so it is very unlikely a programming error)
Upvotes: 3