Reputation: 3577
I have a cluster I am using which has 3 nodes with 110GB of RAM each, and on each node there are 16 cores. I want to keep submmitting jobs to the nodes as long as the memory specified is available.
I am using this bash script called test_slurm.sh
:
#!/bin/sh
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
python test.py
So if I have 33 10gb jobs, and I have 3 nodes with 110gb of RAM I want to be able to run all 33 at once if possible instead of only 3 at once which is what my current setup does.
This is what by squeue looks like:
So only three jobs run at once even though I have plenty of memory for more.
sinfo -o "%all"
returns:
AVAIL|CPUS|TMP_DISK|FEATURES|GROUPS|SHARE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIORITY|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |SOCKETS |CORES |THREADS
up|16|0|(null)|all|NO|infinite|115328|parrot101|parrot101|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot101 |0.01 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
up|16|0|(null)|all|NO|infinite|115328|parrot102|parrot102|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot102 |0.14 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
up|16|0|(null)|all|NO|infinite|115328|parrot103|parrot103|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot103 |0.26 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
The output of squeue -o "%all"
returns:
ACCOUNT|GRES|MIN_CPUS|MIN_TMP_DISK|END_TIME|FEATURES|GROUP|SHARED|JOBID|NAME|COMMENT|TIMELIMIT|MIN_MEMORY|REQ_NODES|COMMAND|PRIORITY|QOS|REASON||ST|USER|RESERVATION|WCKEY|EXC_NODES|NICE|S:C:T|JOBID |EXEC_HOST |CPUS |NODES |DEPENDENCY |ARRAY_JOB_ID |GROUP |SOCKETS_PER_NODE |CORES_PER_SOCKET |THREADS_PER_CORE |ARRAY_TASK_ID |TIME_LEFT |TIME |NODELIST |CONTIGUOUS |PARTITION |PRIORITY |NODELIST(REASON) |START_TIME |STATE |USER |SUBMIT_TIME |LICENSES |CORE_SPECWORK_DIR
(null)|(null)|1|0|N/A|(null)|j1101|no|26609|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 26|0.99998411652632|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26609 |n/a |1 |1 | |26609 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899076 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26610|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 27|0.99998411629349|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26610 |n/a |1 |1 | |26610 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899075 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26611|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 28|0.99998411606066|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26611 |n/a |1 |1 | |26611 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899074 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26612|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 29|0.99998411582782|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26612 |n/a |1 |1 | |26612 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899073 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26613|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 30|0.99998411559499|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26613 |n/a |1 |1 | |26613 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899072 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
Upvotes: 4
Views: 3854
Reputation: 5965
Based on your output of sinfo -o "%all"
I can answer why your jobs are not passing.
If you look under the field CPUS(A/I/O/T)
the output is 16/0/0/16
for all nodes:
A
llocated: 16I
dle (available for jobs): 0O
ther: 0T
otal: 16I.e. somehow the CPUs are the reason for the jobs not passing, not the memory as you expected. All CPUs seem to be allocated by (other) jobs.
Now as to why... For this we currently have insufficient information. The output of squeue -o "%all"
would give more insight.
Upvotes: 3