Reputation: 33
I setup a Beowulf cluster with 3 VMs with MPICH and Boost on each machine. My programs are working fine on my cluster but when I try to use boost::split, execution blocks indefinitely.
Take the following code:
#include <boost/mpi.hpp>
#include <iostream>
namespace mpi = boost::mpi;
int main (int argc , char* argv[])
{
mpi::environment env(argc,argv);
mpi::communicator world;
int group_id = world.rank()%3;
mpi::communicator local = world.split(group_id);
std::cout << "I am process " << world.rank() << " of " << world.size() << "." << std::endl;
std::cout << "I am sub-process " << local.rank() << " of " << local.size() << "." << std::endl;
return 0;
}
When executed on the cluster, nothing happens. But if I execute it only on a single node (and let say with -np 9), it works just fine :
I am process 5 of 9.
I am process 2 of 9.
I am process 3 of 9.
I am process 1 of 9.
I am process 6 of 9.
I am process 7 of 9.
I am process 0 of 9.
I am process 4 of 9.
I am sub-process 2 of 3.
I am sub-process 0 of 3.
I am sub-process 1 of 3.
I am sub-process 2 of 3.
I am sub-process 1 of 3.
I am sub-process 1 of 3.
I am sub-process 0 of 3.
I am process 8 of 9.
I am sub-process 2 of 3.
I am sub-process 0 of 3.
Removing the boost::split call makes the example to execute as intended over the 3 nodes, so the call to split is clearly guilty here.
Any idea what I'm doing wrong with boost::split ?
Upvotes: 1
Views: 76
Reputation: 33
I finaly found the problem: mpirun was sometime trying to use the wrong interface for communications. By specifying the good interface when running mpirun, everything goes fine !
Here is the parameter to give to mpirun:
--mca btl_tcp_if_include [your_network_interface]
Upvotes: 1