Newbie
Newbie

Reputation: 11

MaxRSS larger than ReqMem for slurm job runing python. How come?

I am confused about how exactly allocated and used memory are defined for jobs on a cluster. I am limitting my jobs to use maximum of 150 GB, however they seem to be using ~650GB. This has not happened to me before, and i am wondering if maybe i missunderstand the concepts or memory. The jobs are running on a cluster at my institute. It has multiple nodes, I am however using a GPU partition. My bash script is as follows:

#!/bin/bash
#SBATCH -o /PATH/TO/LOG/name_%A_%a.out.txt
#SBATCH -e /PATH/TO/LOG/name_%A_%a.err.txt
#SBATCH -c 10 --mem 150G --gres=gpu --partition=gpu

source ./venv/bin/activate
srun python main.py

My understanding is that when running this script with sbatch script.sh the job would run on the gpu partition, be able to use 10 CPUs and use maximum of 150 GB RAM. The python main prepares some data and starts an optuna run for hyperparameter optimization (among other things). Parallelism/Multithreading might be used. Since I want to run multiple jobs on the node, I would like to allocate only the needed CPUs and RAM and use the rest for other jobs.

Once the job is done, I check the needed resources with:

sacct -j JobID --format=JobId,JobName,Partition,AllocCPUS,MaxRSS,AveRSS,ReqMem

My output is as follows (modified to remove)

       JobID    JobName  Partition  AllocCPUS     MaxRSS     AveRSS     ReqMem
------------ ---------- ---------- ---------- ---------- ---------- ----------
1489712         script+        gpu         10                            150Gn
1489712.bat+      batch                    10     12368K     12368K      150Gn
1489712.ext+     extern                    10      1164K      1164K      150Gn
1489712.0        python                    10 673344288K 673344288K      150Gn

The python part of the job seems to be using ~674 GB instead of the maximum possible 150. I would expect that if the job really needs so much memory, it would be canceled as soon as it requires > 150GB. This has been my experience so far. Is there any reasonable explanation why the job does not get canceled and runs to the end? Does it really need so much memory? The partition I use has maximum of 700 GB RAM, and while this job was running, there were 2-3 other jobs with ReqMem=150G running at the same time on it.

seff which has been suggested in other questions for checking job requirements is not a known command on the cluster I use.

It would be really great if someone can lead me to the right direction of how I can figure out how much memory does my job really need. It would also be great to know if all the allocated CPUs are actually used.

Upvotes: 1

Views: 14

Answers (0)

Related Questions