Reputation: 433
I have problem with run mpi program on cluster.
My host file look like:
10.0.9.1 slots=2
10.0.12.1 slots=2
10.0.11.1 slots=2
10.0.10.1 slots=2
10.0.6.1 slots=2
10.0.5.1 slots=2
10.0.4.1 slots=2
10.0.2.1 slots=2
10.0.1.1 slots=2
As you see I have 8 nodes. After run some processes end work but other return errors:
node02][[62903,1],7][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.4.1 failed: No route to host (113)
[node04][[62903,1],15][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 10.1.5.1 failed: No route to host (113)
I'm suprised that node02 is trying to connect with 10.1.4.1 host (I havent got this addres in hosts and so on). Second error is simillar to previous that mean node4 is trying to connect with 10.1.5.1. My adres is 10.0.x.1 not 10.1.x.1 why is that and where can I find it?
modprobe: ERROR: could not insert 'ip_tables': Operation not permitted
iptables v1.4.21: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)
Perhaps iptables or your kernel needs to be upgraded.
EDIT:
I have tested many configuration and I discovered that I can run only 10 copies of program (-np 10) with any nodes. Any bigger value for example -np 12 get an error mentioned above.
For example that configuration of nodes is ok:
10.0.11.1 slots=1
10.0.10.1 slots=1
10.0.9.1 slots=1
10.0.6.1 slots=2
10.0.5.1 slots=1
10.0.4.1 slots=2
10.0.2.1 slots=2
Have you ever encountered such a problem?
Upvotes: 0
Views: 853
Reputation: 8395
In Open MPI, the IP in the host file are used internally to start the job.
If you are not running under a supported resource manager, then the plm/rsh
component will use these IPs to ssh
(or rsh
) the orted
daemon on the remote nodes.
For the communications, the btl/tcp
component will detect all available interfaces and try to use them all.
In your case, you might have to blacklist the 10.1.0.0/16
network, or restrict to the 10.0.0.0/16
network. That can be achieved via the command line :
mpirun --mca btl_tcp_if_exclude 10.1.0.0/16 ...
or
mpirun --mca btl_tcp_if_include 10.0.0.0/16 ...
Note you might also have to rescrict the oob/tcp
component which is used to wire up the job. Unlike btl/tcp
, this component uses the first working IP, so that might not be needed.
mpirun --mca btl_tcp_if_inlclude 10.0.0.0/16 --mca oob_tcp_if_include 10.0.0.0/16 ...
Upvotes: 1