Reputation: 2250
At my working environment we have a cluster with some 33 physical nodes. Each node has 2 virtual machines (VMs) with 10 CPUs (4 slots) and 112Gb memory each.
I am submitting jobs to this cluster and below is the maximum memory that the jobs required (obtained with qacct -j [job]
maxvmem 37.893GB
maxvmem 37.660GB
maxvmem 37.980GB
maxvmem 41.059GB
maxvmem 41.615GB
maxvmem 38.744GB
maxvmem 38.615GB
Let's consider the maximum required memory to be 42GB for the rest of this question.
In fact, when submitting 92 jobs to this cluster (without specifying any qsub paramter), I noticed that some of them crashed, apparently due to memory issues. All the jobs that crashed where running on physical nodes with four jobs. Which makes sense: if I have four jobs running on a physical node with 42GB each, 4*42 = 168 (>112), so I am not surprised that some jobs crashed.
I then decided to limit the memory per job. According to this link, this can be done via the -l h_vmem=[maxmem]
qsub parameter, which was added to the shell script submitted to the queue (below I am showing the first three lines of the .sh script, whereas the second line is the one that should be limiting the memory). Note that -l h_vmem
is the memory per slot
#! /bin/bash
#$ -l h_vmem=28G
echo HOSTNAME: `hostname`
After submitting the 92 jobs, if I do qstat -j [job]
I see a line such as:
hard resource_list: h_vmem=28G
Which means 28*4 = 112GB per physical node, which is my limit. That looks OK
However, I see that some physical nodes have already 4 jobs running in it, which is what I wanted to avoid. Given that each job can take up to 42GB of memory, I would expect to have a maximum of 2 jobs per physical node (maximum memory required would be 2*42 = 84GB), so that they would not crash due to lack of memory.
So it seems that qsub is not interpreting correctly the paramter #$ -l h_vmem=28G
on my .sh script as the required memory can go up to 42x4 = 168GB, whereas 28x4 = 112GB should be my limit.
Am I using the wrong qsub parameter (-l h_vmem
), the wrong notation on my .sh script (#$ -l h_vmem=28G
- probably not, as this appears to have been correctly parsed, when issuing qstat
), or anything else?
Upvotes: 0
Views: 1604
Reputation: 90
The option -l m_mem_free=42G
would help in this situation. The amount of memory is per slot.
From the documentation: If a host can’t fulfill the m_mem_free
request then the host is skipped.
2021.1 documentation
Upvotes: 0