Reputation: 5705
I'm having some trouble with resource allocation in the sense that according to how I understood the documentation and applied that to the config file I am expecting some behavior that does not happen.
Here is the relevant excerpt from the config file:
60 SchedulerType=sched/backfill
61 SchedulerParameters=bf_continue,bf_interval=45,bf_resolution=90,max_array_tasks=1000
62 #SchedulerAuth=
63 #SchedulerPort=
64 #SchedulerRootFilter=
65 SelectType=select/cons_res
66 SelectTypeParameters=CR_CPU_Memory
67 FastSchedule=1
...
102 NodeName=cn_burebista Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN
103 PartitionName=main_compute Nodes=cn_burebista Shared=YES Default=YES MaxTime=76:00:00 State=UP
According to the above I have the backfill scheduler enabled with CPUs and Memory configured as resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would expect that he backfill scheduler attempts to allocate the resources in order to fill as much of the cores as possible if there are multiple processes asking for more resources than available. In my case I have the following queue:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2361 main_comp training mc PD 0:00 1 (Resources)
2356 main_comp skrf_ori jh R 58:41 1 cn_burebista
2357 main_comp skrf_ori jh R 44:13 1 cn_burebista
Jobs 2356 and 2357 are asking for 16 CPUs each, job 2361 is asking for 20 CPUs, meaning in total 52 CPUs As seen from above job 2361(which is started by a different user) is marked as pending due to lack of resources although there are plenty of CPUs and memory available. "scontrol show nodes cn_burebista" gives me the following:
NodeName=cn_burebista Arch=x86_64 CoresPerSocket=14
CPUAlloc=32 CPUErr=0 CPUTot=56 CPULoad=21.65
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=cn_burebista NodeHostName=cn_burebista Version=16.05
OS=Linux RealMemory=256000 AllocMem=64000 FreeMem=178166 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
BootTime=2018-03-09T12:04:52 SlurmdStartTime=2018-03-20T10:35:50
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I'm going through the documentation again and again but I cannot figure out what am I doing wrong ... Why do I have the above situation? What should I change to my config to make this work?
Similar(not the same situation) question asked here but no answer
EDIT:
This is part of my script for the task:
3 # job parameters
4 #SBATCH --job-name=training_carlib
5 #SBATCH --output=training_job_%j.out
6
7 # needed resources
8 #SBATCH --ntasks=1
9 #SBATCH --cpus-per-task=20
10 #SBATCH --export=ALL
17 export OMP_NUM_THREADS=20
18 srun ./super_awesome_app
As it can be seen the request is made for 1 task per node and 20 CPUs per task. As the scheduler is configured to consider CPUs as resources and not cores and I ask explicitly for CPUs in the script why would the job ask for cores? This is my reference document.
EDIT 2:
Here's the output from the suggested command:
JobId=2383 JobName=training_carlib
UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A
Priority=4294901726 Nice=0 Account=(null) QOS=(null)
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A
SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38
StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=main_compute AllocNode:Sid=zalmoxis:23690
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=cn_burebista
NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh
WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier
StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
StdIn=/dev/null
StdOut=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
Power=
Upvotes: 4
Views: 10059
Reputation: 59072
In your configuration, Slurm cannot allocate two jobs on two hardware threads of the same core. In your example, Slurm would thus need at least 10 cores completely free to start your job.
Also, if the default block:cyclic
task affinity configuration is used, Slurm cycles over sockets to distribute tasks in a node.
So what is happening is the following I believe:
You can get the exact CPU numbers allocated to a job using
scontrol show -dd job <jobid>
To configure Slurm in a way that it considers hardware threads exactly as if they were core, you need indeed to define
SelectTypeParameters=CR_CPU_Memory
but you also need to specify CPUs
directly in the node definition
NodeName=cn_burebista CPUs=56 RealMemory=256000 State=UNKNOWN
and not let Slurm compute CPUs
from Sockets
, CoresPerSocket
, and ThreadsPerCore
.
See the section about ThreadsPerCore in the slurm.conf manpage section about node definition.
Upvotes: 1