Stefano Potter
Stefano Potter

Reputation: 3577

Run multiple jobs at a time per node through SLURM

I have a cluster I am using which has 3 nodes with 110GB of RAM each, and on each node there are 16 cores. I want to keep submmitting jobs to the nodes as long as the memory specified is available.

I am using this bash script called test_slurm.sh:

#!/bin/sh
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
python test.py

So if I have 33 10gb jobs, and I have 3 nodes with 110gb of RAM I want to be able to run all 33 at once if possible instead of only 3 at once which is what my current setup does.

This is what by squeue looks like: enter image description here

So only three jobs run at once even though I have plenty of memory for more.

sinfo -o "%all" returns:

    AVAIL|CPUS|TMP_DISK|FEATURES|GROUPS|SHARE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIORITY|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |SOCKETS |CORES |THREADS
    up|16|0|(null)|all|NO|infinite|115328|parrot101|parrot101|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot101 |0.01 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
    up|16|0|(null)|all|NO|infinite|115328|parrot102|parrot102|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot102 |0.14 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
    up|16|0|(null)|all|NO|infinite|115328|parrot103|parrot103|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot103 |0.26 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1

The output of squeue -o "%all"

returns:

ACCOUNT|GRES|MIN_CPUS|MIN_TMP_DISK|END_TIME|FEATURES|GROUP|SHARED|JOBID|NAME|COMMENT|TIMELIMIT|MIN_MEMORY|REQ_NODES|COMMAND|PRIORITY|QOS|REASON||ST|USER|RESERVATION|WCKEY|EXC_NODES|NICE|S:C:T|JOBID |EXEC_HOST |CPUS |NODES |DEPENDENCY |ARRAY_JOB_ID |GROUP |SOCKETS_PER_NODE |CORES_PER_SOCKET |THREADS_PER_CORE |ARRAY_TASK_ID |TIME_LEFT |TIME |NODELIST |CONTIGUOUS |PARTITION |PRIORITY |NODELIST(REASON) |START_TIME |STATE |USER |SUBMIT_TIME |LICENSES |CORE_SPECWORK_DIR
    (null)|(null)|1|0|N/A|(null)|j1101|no|26609|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 26|0.99998411652632|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26609 |n/a |1 |1 | |26609 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899076 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
    (null)|(null)|1|0|N/A|(null)|j1101|no|26610|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 27|0.99998411629349|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26610 |n/a |1 |1 | |26610 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899075 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
    (null)|(null)|1|0|N/A|(null)|j1101|no|26611|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 28|0.99998411606066|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26611 |n/a |1 |1 | |26611 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899074 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
    (null)|(null)|1|0|N/A|(null)|j1101|no|26612|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 29|0.99998411582782|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26612 |n/a |1 |1 | |26612 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899073 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
    (null)|(null)|1|0|N/A|(null)|j1101|no|26613|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 30|0.99998411559499|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26613 |n/a |1 |1 | |26613 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899072 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python

Upvotes: 4

Views: 3854

Answers (1)

Tom de Geus
Tom de Geus

Reputation: 5965

Based on your output of sinfo -o "%all" I can answer why your jobs are not passing.

If you look under the field CPUS(A/I/O/T) the output is 16/0/0/16 for all nodes:

  • Allocated: 16
  • Idle (available for jobs): 0
  • Other: 0
  • Total: 16

I.e. somehow the CPUs are the reason for the jobs not passing, not the memory as you expected. All CPUs seem to be allocated by (other) jobs.

Now as to why... For this we currently have insufficient information. The output of squeue -o "%all" would give more insight.

Upvotes: 3

Related Questions