Reputation: 91
We use Slurm resource manager to send jobs to the cluster. Recently, we upgraded the Slurm version from 15 to 18.
Since the upgrade I encounter the following problem:
I consequently send jobs that require single core and should utilize ~100% cpu.
However, when those jobs arrive to the same computing node, it seems that they roughly share a single core. I.e, when the 1st job arrive it gets 100% cpu, when the 2nd arrives they both get 50% etc. etc. Sometimes there are 20 jobs on the same node (it has 24 physical cores) and each get ~5% cpu.
The setup that reproduce the problem is very simple:
The executable is a simple C busy loop that was verified to consume ~100% cpu when run locally.
The script file that I send is:
> cat my.sh
#/bin/bash
/path/to/busy_loop
The sbatch command is:
sbatch -n1 -c1 my.sh
Some observations:
sbatch -n2 -c1 my.sh
and inside the script file use mpirun /path/to/busy_loop
, it seems that every process get 100% cpu. However, if another such job will be sent to the same node, they will share same 2 cores and each of the 4 processes get 50% cpu.I didn't find any reference to a similar problem over the web and every reference or help will be very much appreciated.
Upvotes: 0
Views: 390
Reputation: 91
After trying different changes in slurm.conf
the change that solved the problem was adding the line:
TaskPlugin=task/affinity
Upvotes: 0