Matheus Mendonça
Matheus Mendonça

Reputation: 21

MPI_Send getting stuck when executing with different nodes

I have a very simple MPI program where node 0 sends a character to node 1, but the send and receive are getting stuck whenever I use two or more different machines. The program works fine when I use several processes in only one machine. It seems to be a communication problem, but I can't figure it out what it is.....

Here's the code:

int main(int argc, char *argv[]) {
    int numtasks, rank, tag = 1;
    char inmsg, outmsg = 'x';

    MPI_Status stat;
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if ( rank == 0 ) {
        MPI_Send(&outmsg, 1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);
    }

    else if ( rank == 1 ) {
        MPI_Recv(&inmsg, 1, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &stat);
    }

    MPI_Finalize();
}

Also, here's some important notes:

  1. I'm using a cluster of 2 Virtual Machines inside Google Compute Engine: mpi-test-uaiw and mpi-test-130b;
  2. I have already configured the passwordless ssh between the two VMs, that is, from mpi-test-uaiw I can just type ssh mpi-test-130b and it works fine (the opposite also works);
  3. The simple "Hello World" using MPI works with this cluster, but it does not contain any send or receive operations;
  4. Firewall is deactivated.

Any help would be appreciated. Thanks!

Upvotes: 0

Views: 546

Answers (1)

Matheus Mendonça
Matheus Mendonça

Reputation: 21

I found a solution to my problem:

I was using MPICH and running my program with mpirun. The problem, from what it seems, is that mpich was using the wrong network interface. Each node has two interfaces: lo and ens4. From what I saw in other posts, it seems that lo is used for transferring data from one node to itself, while ens4 is used to communicate with other nodes. I verified this using the following ping commands:

  • $ ifconfig -a: shows the available interfaces;
  • From mpi-test-uaiw:$ ping -I lo mpi-test-130b -> FAILS
  • From mpi-test-uaiw:$ ping -I ens4 mpi-test-130b -> SUCCESS
  • From mpi-test-uaiw:$ ping -I lo mpi-test-uaiw -> SUCCESS
  • From mpi-test-uaiw:$ ping -I ens4 mpi-test-uaiw -> FAILS

One of the possible solutions is to use the mpirun --mca btl_tcp_if_include ens4 to make sure mpirun uses the ens4 interface to communicate with the other node. But this didn't work for me, since MPICH doesn't recognize the --mca parameter. Therefore, I did the following:

  1. Removed the packages I used to install MPICH (in both nodes): $ sudo apt-get remove libcr-dev mpich mpich-doc;
  2. Installed the OpenMPI (in both nodes): $ sudo apt install openmpi-bin openmpi-doc libopenmpi-dev;

By installing the OpenMPI, my code worked. Hope it helps anyone who faces this same problem!

Upvotes: 0

Related Questions