Reputation: 21
I am trying to run a job (python code) on cluster using MPI. There is 63GB of memory available on each node. When I run it on one node, I specify PBS parameters with (only relevant parameters are listed here):
#PBS -l mem=60GB
#PBS -l nodes=node01.cluster:ppn=32
time mpiexec -n 32 python code.py
Than works just fine.
Since PBS man page says mem
is memory per entire job, my parameters when trying to run it on two nodes, are
#PBS -l mem=120GB
#PBS -l nodes=node01.cluster:ppn=32+node02.cluster:ppn=32
time mpiexec -n 64 python code.py
This doesn't work (qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max mem requirement
). It fails even if I set mem=70GB
for example (in case system needs some more memory).
If I set mem=60GB
when trying to use both nodes, I get
=>> PBS: job killed: mem job total xx kb exceeded limit yy kb.
I tried it with pmem
as well (that's pmem=1875MB
), but no success.
My question is: How can I use entire 120GB of memory?
Upvotes: 2
Views: 1881
Reputation: 74395
Torque / PBS ignores the mem
resource unless the job uses a single node (see here):
Maximum amount of physical memory used by the job. (Ignored on Darwin, Digital Unix, Free BSD, HPUX 11, IRIX, NetBSD, and SunOS. Also ignored on Linux if number of nodes is not 1. Not implemented on AIX and HPUX 10.)
You should instead use the pmem
resource that limits the memory per job process. With ppn=32
you should set pmem
to 1920MB in order to get 60 GB per node. In that case you should mind that pmem
does not allow flexible distribution of memory between the processes running on the node the same way mem
does (since the latter is accounted as an aggregated value while pmem
applies to each process individually).
Upvotes: 2