Micchaleq
Micchaleq

Reputation: 433

Run mpi program on cluster

I have problem with run mpi program on cluster.

My host file look like:

10.0.9.1 slots=2
10.0.12.1 slots=2
10.0.11.1 slots=2
10.0.10.1 slots=2
10.0.6.1 slots=2
10.0.5.1 slots=2
10.0.4.1 slots=2
10.0.2.1 slots=2
10.0.1.1 slots=2

As you see I have 8 nodes. After run some processes end work but other return errors:

node02][[62903,1],7][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.4.1 failed: No route to host (113)
[node04][[62903,1],15][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.5.1 failed: No route to host (113)

I'm suprised that node02 is trying to connect with 10.1.4.1 host (I havent got this addres in hosts and so on). Second error is simillar to previous that mean node4 is trying to connect with 10.1.5.1. My adres is 10.0.x.1 not 10.1.x.1 why is that and where can I find it?

modprobe: ERROR: could not insert 'ip_tables': Operation not permitted
iptables v1.4.21: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.

EDIT:

I have tested many configuration and I discovered that I can run only 10 copies of program (-np 10) with any nodes. Any bigger value for example -np 12 get an error mentioned above.

For example that configuration of nodes is ok:

10.0.11.1 slots=1
10.0.10.1 slots=1
10.0.9.1 slots=1
10.0.6.1 slots=2
10.0.5.1 slots=1
10.0.4.1 slots=2
10.0.2.1 slots=2

Have you ever encountered such a problem?

Upvotes: 0

Views: 853

Answers (1)

Gilles Gouaillardet
Gilles Gouaillardet

Reputation: 8395

In Open MPI, the IP in the host file are used internally to start the job. If you are not running under a supported resource manager, then the plm/rsh component will use these IPs to ssh (or rsh) the orted daemon on the remote nodes.

For the communications, the btl/tcp component will detect all available interfaces and try to use them all.

In your case, you might have to blacklist the 10.1.0.0/16 network, or restrict to the 10.0.0.0/16 network. That can be achieved via the command line :

mpirun --mca btl_tcp_if_exclude 10.1.0.0/16 ...

or

mpirun --mca btl_tcp_if_include 10.0.0.0/16 ...

Note you might also have to rescrict the oob/tcp component which is used to wire up the job. Unlike btl/tcp, this component uses the first working IP, so that might not be needed.

mpirun --mca btl_tcp_if_inlclude 10.0.0.0/16 --mca oob_tcp_if_include 10.0.0.0/16 ...

Upvotes: 1

Related Questions