BCArg
BCArg

Reputation: 2250

maximum memory for batch jobs with qsub

At my working environment we have a cluster with some 33 physical nodes. Each node has 2 virtual machines (VMs) with 10 CPUs (4 slots) and 112Gb memory each.

I am submitting jobs to this cluster and below is the maximum memory that the jobs required (obtained with qacct -j [job]

maxvmem      37.893GB
maxvmem      37.660GB
maxvmem      37.980GB
maxvmem      41.059GB
maxvmem      41.615GB
maxvmem      38.744GB
maxvmem      38.615GB

Let's consider the maximum required memory to be 42GB for the rest of this question.

In fact, when submitting 92 jobs to this cluster (without specifying any qsub paramter), I noticed that some of them crashed, apparently due to memory issues. All the jobs that crashed where running on physical nodes with four jobs. Which makes sense: if I have four jobs running on a physical node with 42GB each, 4*42 = 168 (>112), so I am not surprised that some jobs crashed.

I then decided to limit the memory per job. According to this link, this can be done via the -l h_vmem=[maxmem] qsub parameter, which was added to the shell script submitted to the queue (below I am showing the first three lines of the .sh script, whereas the second line is the one that should be limiting the memory). Note that -l h_vmem is the memory per slot

#! /bin/bash
#$ -l h_vmem=28G
echo HOSTNAME: `hostname`

After submitting the 92 jobs, if I do qstat -j [job] I see a line such as:

hard resource_list:         h_vmem=28G

Which means 28*4 = 112GB per physical node, which is my limit. That looks OK

However, I see that some physical nodes have already 4 jobs running in it, which is what I wanted to avoid. Given that each job can take up to 42GB of memory, I would expect to have a maximum of 2 jobs per physical node (maximum memory required would be 2*42 = 84GB), so that they would not crash due to lack of memory.

So it seems that qsub is not interpreting correctly the paramter #$ -l h_vmem=28G on my .sh script as the required memory can go up to 42x4 = 168GB, whereas 28x4 = 112GB should be my limit.

Am I using the wrong qsub parameter (-l h_vmem), the wrong notation on my .sh script (#$ -l h_vmem=28G - probably not, as this appears to have been correctly parsed, when issuing qstat), or anything else?

Upvotes: 0

Views: 1604

Answers (1)

Dom
Dom

Reputation: 90

The option -l m_mem_free=42G would help in this situation. The amount of memory is per slot. From the documentation: If a host can’t fulfill the m_mem_free request then the host is skipped. 2021.1 documentation

Upvotes: 0

Related Questions