Ana Khorguani
Ana Khorguani

Reputation: 926

Issues when running MPI program on two cluster nodes

I have a very simple MPI program:

  int my_rank;
  int my_new_rank;
  int size;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);


  if (my_rank == 0 || my_rank == 18 || my_rank == 36){
    char hostbuffer[256];
    gethostname(hostbuffer, sizeof(hostbuffer));
    printf("Hostname: %s\n", hostbuffer);
  }
  MPI_Finalize();

I am running it on a cluster with two nodes. I have a make file and with mpicc command I generate cannon.run executable file. I run it with the following command:

time mpirun --mca btl ^openib -n 64 -hostfile ../second_machinefile ./cannon.run

in second_machinefile I have names of these two nodes. The wierd problem is that, when I run this command from one node, it executes normally, however when I run the command from another node I get error:

Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x30

After trying to tun with GDB I got this backtrace:

#0  0x00007ffff646e936 in ?? ()
   from /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so
#1  0x00007ffff6449733 in pmix_common_dstor_init ()
   from /lib/x86_64-linux-gnu/libmca_common_dstore.so.1
#2  0x00007ffff646e5b4 in ?? ()
   from /usr/lib/x86_64-linux-gnu/pmix/lib/pmix/mca_gds_ds21.so
#3  0x00007ffff659e46e in pmix_gds_base_select ()
   from /lib/x86_64-linux-gnu/libpmix.so.2
#4  0x00007ffff655688d in pmix_rte_init ()
   from /lib/x86_64-linux-gnu/libpmix.so.2
#5  0x00007ffff6512d7c in PMIx_Init () from /lib/x86_64-linux-gnu/libpmix.so.2
#6  0x00007ffff660afe4 in ext2x_client_init ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pmix_ext2x.so
#7  0x00007ffff72e1656 in ?? ()
   from /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_pmi.so
#8  0x00007ffff7a9d11a in orte_init ()
   from /lib/x86_64-linux-gnu/libopen-rte.so.40
#9  0x00007ffff7d6de62 in ompi_mpi_init ()
   from /lib/x86_64-linux-gnu/libmpi.so.40
#10 0x00007ffff7d9c17e in PMPI_Init () from /lib/x86_64-linux-gnu/libmpi.so.40
#11 0x00005555555551d6 in main ()

which to be honest I don't fully understand.

My main confusion is that the program is executed properly from machine_1, it connects to the machine_2 without errors and processes are initialized on both machines. But when I try to execute the same command from machine_2, it is not able to connect machine_1. The program is also running correctly if I run it only on machine_2 as well, when decreasing number of processes so it fits in one machine.

Is there anything I am doing wrong? or what could I try to understand better the cause of the problem?

Upvotes: 1

Views: 951

Answers (1)

Gilles Gouaillardet
Gilles Gouaillardet

Reputation: 8395

This is indeed a bug in Open PMIx that is addressed at https://github.com/openpmix/openpmix/pull/1580

Meanwhile, a workaround is to blacklist the gds/ds21 component :

  • One option is to

export PMIX_MCA_gds=^ds21

before invoking mpirun

  • An other option is to add the following line
gds = ^ds21

to the PMIx config file located in <pmix_prefix>/etc/pmix-mca-params.conf

Upvotes: 2

Related Questions