Jan van der Laan
Jan van der Laan

Reputation: 8105

Simultaneously running multiple jobs on same node using slurm

Most of our jobs are either (relatively) low on CPU and high on memory (data processing) or low on memory and high on CPU (simulations). The server we have is generally big enough (256GB Mem; 16 cores) to accommodate multiple jobs running at the same time and we would like use slurm to schedule these jobs. However, testing on a small (4 CPU) amazon server, I am unable to get this working. I would have to use SelectType=select/cons_res and SelectTypeParameters=CR_CPU_Memory as far as I know. However, when starting multiple jobs using a single CPU these are started sequentially and not in parallel.

My slurm.conf

ControlMachine=ip-172-31-37-52

MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none

# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory

# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
JobAcctGatherType=jobacct_gather/none
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log

# COMPUTE NODES
NodeName=ip-172-31-37-52 CPUs=4 RealMemory=7860 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
PartitionName=test Nodes=ip-172-31-37-52 Default=YES MaxTime=INFINITE State=UP

job.sh

#!/bin/bash
sleep 30
env

Output when running jobs:

ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh 
Submitted batch job 2
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh 
Submitted batch job 3
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh 
Submitted batch job 4
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh 
Submitted batch job 5
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh 
Submitted batch job 6
ubuntu@ip-172-31-37-52:~$ sbatch -n1 -N1 job.sh 
Submitted batch job 7
ubuntu@ip-172-31-37-52:~$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3      test   job.sh   ubuntu PD       0:00      1 (Resources)
                 4      test   job.sh   ubuntu PD       0:00      1 (Priority)
                 5      test   job.sh   ubuntu PD       0:00      1 (Priority)
                 6      test   job.sh   ubuntu PD       0:00      1 (Priority)
                 7      test   job.sh   ubuntu PD       0:00      1 (Priority)
                 2      test   job.sh   ubuntu  R       0:03      1 ip-172-31-37-52

The jobs are run sequentially, while in principle it should be possible to run 4 jobs in parallel.

Upvotes: 1

Views: 3468

Answers (1)

damienfrancois
damienfrancois

Reputation: 59072

You do not specify memory in your submission files. Also, you do not specify a default value for memory (DefMemPerNode, or DefMemPerCPU ). In that case, Slurm allocates the full memory to the jobs, it is therefore not able to allocate multiple jobs on one node.

Try specifying the memory :

sbatch -n1 -N1 --mem-per-cpu=1G job.sh

You can check the resources consumed on a node with scontrol show node (Look for the AllocTRES value).

Upvotes: 2

Related Questions