Reputation: 11
I am confused about how exactly allocated and used memory are defined for jobs on a cluster. I am limitting my jobs to use maximum of 150 GB, however they seem to be using ~650GB. This has not happened to me before, and i am wondering if maybe i missunderstand the concepts or memory. The jobs are running on a cluster at my institute. It has multiple nodes, I am however using a GPU partition. My bash script is as follows:
#!/bin/bash
#SBATCH -o /PATH/TO/LOG/name_%A_%a.out.txt
#SBATCH -e /PATH/TO/LOG/name_%A_%a.err.txt
#SBATCH -c 10 --mem 150G --gres=gpu --partition=gpu
source ./venv/bin/activate
srun python main.py
My understanding is that when running this script with sbatch script.sh
the job would run on the gpu partition, be able to use 10 CPUs and use maximum of 150 GB RAM.
The python main prepares some data and starts an optuna run for hyperparameter optimization (among other things). Parallelism/Multithreading might be used.
Since I want to run multiple jobs on the node, I would like to allocate only the needed CPUs and RAM and use the rest for other jobs.
Once the job is done, I check the needed resources with:
sacct -j JobID --format=JobId,JobName,Partition,AllocCPUS,MaxRSS,AveRSS,ReqMem
My output is as follows (modified to remove)
JobID JobName Partition AllocCPUS MaxRSS AveRSS ReqMem
------------ ---------- ---------- ---------- ---------- ---------- ----------
1489712 script+ gpu 10 150Gn
1489712.bat+ batch 10 12368K 12368K 150Gn
1489712.ext+ extern 10 1164K 1164K 150Gn
1489712.0 python 10 673344288K 673344288K 150Gn
The python part of the job seems to be using ~674 GB instead of the maximum possible 150. I would expect that if the job really needs so much memory, it would be canceled as soon as it requires > 150GB. This has been my experience so far. Is there any reasonable explanation why the job does not get canceled and runs to the end? Does it really need so much memory? The partition I use has maximum of 700 GB RAM, and while this job was running, there were 2-3 other jobs with ReqMem=150G running at the same time on it.
seff which has been suggested in other questions for checking job requirements is not a known command on the cluster I use.
It would be really great if someone can lead me to the right direction of how I can figure out how much memory does my job really need. It would also be great to know if all the allocated CPUs are actually used.
Upvotes: 1
Views: 14