Reputation: 10012
I am new to Slurm. I am running a job with srun
that uses two nodes and encountering a problem. When I run the same job but only using one node (either of them), the task completes. I will first write the principal error message, and next my setup
1. Error message
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected. This is highly unusual.
The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).
Local host: NodeA
Local PID: 28203
Peer hostname: NodeB ([[62771,0],8])
Source IP of socket: <socket IP>
Known IPs of peer: <a list of 4 IP6 addresses>
--------------------------------------------------------------------------
2. The setup
The job runs are like this
srun --mpi=pmi2 --p <a partition with the nodes> --job-name <some name> \
--nodelist = <HERE THERE IS A DIFFERENCE see below> \
-n${gpus} --gres=gpu:8 \
--ntasks-per-node=8 \
python -u main.py
Now the differences. There are only two differences. The job that successfully runs
nodelist=nodeA
The job that is stuck and throws the above error
nodelist=nodeA,nodeB
Both processes also shows the following errors
libibverbs: Warning: couldn't load driver 'libocrdma-rdmav34.so': libocrdma-rdmav34.so: cannot open shared object file: No such file or directory
[NodeA][[62771,0],0][btl_openib_component.c:1648:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: NodeA
Local device: mlx5_0
--------------------------------------------------------------------------
but it doesn't seem to affect the outcome since one of the jobs finishes succesfully
I am new to this world. Can someone indicate me what could be going wrong and how to solve it. I have searched and it seems it might be related to multiple IP interfaces in the same subnet but I don't know what this implies and how can it be solved
Upvotes: 0
Views: 75