celavek
celavek

Reputation: 5705

slurm jobs are pending but resources are available

I'm having some trouble with resource allocation in the sense that according to how I understood the documentation and applied that to the config file I am expecting some behavior that does not happen.

Here is the relevant excerpt from the config file:

 60 SchedulerType=sched/backfill                                                                                            
 61 SchedulerParameters=bf_continue,bf_interval=45,bf_resolution=90,max_array_tasks=1000                                    
 62 #SchedulerAuth=                                                                                                         
 63 #SchedulerPort=                                                                                                         
 64 #SchedulerRootFilter=                                                                                                   
 65 SelectType=select/cons_res                                                                                              
 66 SelectTypeParameters=CR_CPU_Memory                                                                                      
 67 FastSchedule=1
...     
 102 NodeName=cn_burebista Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000  State=UNKNOWN                      
 103 PartitionName=main_compute Nodes=cn_burebista Shared=YES Default=YES MaxTime=76:00:00 State=UP

According to the above I have the backfill scheduler enabled with CPUs and Memory configured as resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would expect that he backfill scheduler attempts to allocate the resources in order to fill as much of the cores as possible if there are multiple processes asking for more resources than available. In my case I have the following queue:

 JOBID PARTITION     NAME     USER      ST       TIME  NODES NODELIST(REASON)
 2361 main_comp     training   mc       PD       0:00      1           (Resources)
 2356 main_comp     skrf_ori   jh       R        58:41      1          cn_burebista
 2357 main_comp     skrf_ori   jh       R        44:13      1          cn_burebista

Jobs 2356 and 2357 are asking for 16 CPUs each, job 2361 is asking for 20 CPUs, meaning in total 52 CPUs As seen from above job 2361(which is started by a different user) is marked as pending due to lack of resources although there are plenty of CPUs and memory available. "scontrol show nodes cn_burebista" gives me the following:

NodeName=cn_burebista Arch=x86_64 CoresPerSocket=14
   CPUAlloc=32 CPUErr=0 CPUTot=56 CPULoad=21.65
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cn_burebista NodeHostName=cn_burebista Version=16.05
   OS=Linux RealMemory=256000 AllocMem=64000 FreeMem=178166 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2018-03-09T12:04:52 SlurmdStartTime=2018-03-20T10:35:50
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I'm going through the documentation again and again but I cannot figure out what am I doing wrong ... Why do I have the above situation? What should I change to my config to make this work?

Similar(not the same situation) question asked here but no answer

EDIT:

This is part of my script for the task:

  3 # job parameters                                                                                                         
  4 #SBATCH --job-name=training_carlib                                                                                       
  5 #SBATCH --output=training_job_%j.out                                                                                     
  6                                                                                                                          
  7 # needed resources                                                                                                       
  8 #SBATCH --ntasks=1                                                                                                       
  9 #SBATCH --cpus-per-task=20                                                                                               
 10 #SBATCH --export=ALL       

 17 export OMP_NUM_THREADS=20                                                                                                
 18 srun ./super_awesome_app

As it can be seen the request is made for 1 task per node and 20 CPUs per task. As the scheduler is configured to consider CPUs as resources and not cores and I ask explicitly for CPUs in the script why would the job ask for cores? This is my reference document.

EDIT 2:

Here's the output from the suggested command:

JobId=2383 JobName=training_carlib
   UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A
   Priority=4294901726 Nice=0 Account=(null) QOS=(null)
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A
   SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38
   StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=main_compute AllocNode:Sid=zalmoxis:23690
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=cn_burebista
   NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
   TRES=cpu=20,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh
   WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier
   StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
   StdIn=/dev/null
   StdOut=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
   Power=

Upvotes: 4

Views: 10059

Answers (1)

damienfrancois
damienfrancois

Reputation: 59072

In your configuration, Slurm cannot allocate two jobs on two hardware threads of the same core. In your example, Slurm would thus need at least 10 cores completely free to start your job. Also, if the default block:cyclic task affinity configuration is used, Slurm cycles over sockets to distribute tasks in a node.

So what is happening is the following I believe:

  • Job 2356 submitted, being allocated 16 physical cores because of the default task distribution
  • Job 2357 submitted, being allocated 2 hardware threads on 8 physical cores, overriding default task distribution to get the job to run
  • Job 2361 submitted, waiting for at least 10 physical cores to become available.

You can get the exact CPU numbers allocated to a job using

scontrol show -dd job <jobid>

To configure Slurm in a way that it considers hardware threads exactly as if they were core, you need indeed to define

SelectTypeParameters=CR_CPU_Memory 

but you also need to specify CPUs directly in the node definition

NodeName=cn_burebista CPUs=56 RealMemory=256000  State=UNKNOWN  

and not let Slurm compute CPUs from Sockets, CoresPerSocket, and ThreadsPerCore.

See the section about ThreadsPerCore in the slurm.conf manpage section about node definition.

Upvotes: 1

Related Questions