KansaiRobot
KansaiRobot

Reputation: 10012

Slurm's OpenMPi problem when running a job with two nodes

I am new to Slurm. I am running a job with srun that uses two nodes and encountering a problem. When I run the same job but only using one node (either of them), the task completes. I will first write the principal error message, and next my setup

1. Error message

--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          NodeA
  Local PID:           28203
  Peer hostname:       NodeB ([[62771,0],8])
  Source IP of socket: <socket IP>
  Known IPs of peer:   <a list of 4 IP6 addresses>
--------------------------------------------------------------------------

2. The setup

The job runs are like this

srun --mpi=pmi2  --p  <a partition with the nodes> --job-name <some name> \
   --nodelist = <HERE THERE IS A DIFFERENCE see below> \
   -n${gpus}  --gres=gpu:8 \
 --ntasks-per-node=8 \
python -u main.py 

Now the differences. There are only two differences. The job that successfully runs

  1. gpus=8
  2. the node list has only one node nodelist=nodeA

The job that is stuck and throws the above error

  1. gpus=16
  2. The node list has two nodes nodelist=nodeA,nodeB

Both processes also shows the following errors

libibverbs: Warning: couldn't load driver 'libocrdma-rdmav34.so': libocrdma-rdmav34.so: cannot open shared object file: No such file or directory
[NodeA][[62771,0],0][btl_openib_component.c:1648:init_one_device] error obtaining device attributes for mlx5_0 errno says No space left on device
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   NodeA
  Local device: mlx5_0
--------------------------------------------------------------------------

but it doesn't seem to affect the outcome since one of the jobs finishes succesfully

I am new to this world. Can someone indicate me what could be going wrong and how to solve it. I have searched and it seems it might be related to multiple IP interfaces in the same subnet but I don't know what this implies and how can it be solved

Upvotes: 0

Views: 75

Answers (0)

Related Questions