Reputation: 83
We are running a small cluster environment with Intel Xeon nodes connected via Infiniband. The login node is not attached to the infiniband interconnect. All nodes run Debian Jessie.
We run Slurm 14.03.9 on the Login node. As the system OpenMPI is outdated and does not support the MPI3-Interface (which I require), I compiled a custom OpenMPI 2.0.1.
When I start MPI jobs by hand via
mpirun --hostfile hosts -np xx program_name,
it runs fine, also on multiple nodes, and takes full advantage of Infiniband. Good.
However, when I call my MPI application from inside a Slurm runscript, it crashes with strange Segfaults. I compiled OpenMPI with Slurm support, and also the PMI seems to work, so I can simply write
mpirun program_name
in the Slurm runscript, and it automatically dispatches the jobs to the correct nodes with the correct number of CPU cores. However, I keep getting these segfaults.
Explicitly specifying "-np" and "--hostfile" to mpirun in the Slurm runscript also does not help. The exactly same command which runs fine when started by hand leads to a segfault when started inside the Slurm environment.
Before the segfaults occur, I get the following error message from OpenMPI:
--------------------------------------------------------------------------
Failed to create a completion queue (CQ):
Hostname: xxxx
Requested CQE: 16384
Error: Cannot allocate memory
Check the CQE attribute.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly. This may
indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component
in this run.
Hostname: xxxx
--------------------------------------------------------------------------
I googled for it, but did not find much useful imformation. I assumed that it might be a limit on locked memory, but executing "ulimit -l" on the compute nodes returns "unlimited" as it should.
I appreciate any help to get my jobs to run with OpenMPI inside the Slurm environment.
Upvotes: 3
Views: 3754
Reputation: 83
Finally, I was able to resolve the problem.
The segfaults were indeed related to the error message posted above, which was a consequence of a "max locked memory" limit on the compute node where Slurm dispatched the job.
I struggled long time to lift this locked memory limit. All the standard procedures one finds via Google did not work (neither editing /etc/security/limits.conf
nor editing /etc/init.d/slurmd
). The reason was that my Debian Jessie nodes use systemd
, which does not honor these files. I had to add the line
[Service]
LimitMEMLOCK=32768000000
into the file /etc/systemd/system/multi-user.target.wants/slurmd.service
on all my nodes. It did not work with unlimited
, so I had to use the total system RAM in bytes instead. After modifying this file, I executed
systemctl daemon-reload
systemctl restart slurmd
on all nodes, and finally the problems vanished. Thank you, Carles Fenoy, for your valuable comments!
Upvotes: 2