Reputation: 1530
I have configured two host with ompi, and I am able to run below sample code successfully in both seperately
#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
dest = 1;
source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
else if (rank == 1) {
dest = 0;
source = 0;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
}
rc = MPI_Get_count(&Stat, MPI_CHAR, &count);
printf("Task %d: Received %d char(s) from task %d with tag %d \n",
rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);
MPI_Finalize();
}
mpirun -np 2 sendReceive.o
works fine.
mpirun -np 2 --host host1,host1 sendReceive.o
[ip-172-31-71-xx:11221] [[55975,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 398
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[55975,0],0] on node ip-172-31-78-xx
Remote daemon: [[55975,0],1] on node ip-172-31-71-xx
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
I verified i can ssh between the host and configured correctly. I am not able to narrow down to the problem here. any suggestion?
Answer: by mistake I take different version of mpi in each system. when i correct the version, its working !!!
Upvotes: 2
Views: 1986
Reputation: 2576
You have to allow your security groups to pass mpi communication within hosts. You can fix this by first limitting your MPI communication to a specific port range and allowing this port range in your Security Group under Custom TCP port range. Then you should be able to work this as expected. To limit the port range, refer openmpi-mca-params.conf (According to the configuration file:)
By default, two files are searched (in order):
$HOME/.openmpi/mca-params.conf
: The user-supplied set of values takes the highest precedence.
$prefix/etc/openmpi-mca-params.conf
: The system-supplied set of values has a lower precedence.
To allow security groups to communicate custom TCP ports,
Upvotes: 1