Reputation: 65
I'm new in using SLURM to train a batch of Convolutional Neural Networks. To track all trained CNN's easily I'd like to pass the SLURM jobID as an input argument to python. Passing other variables as arguments work fine. However, I cannot get access to the SLURM jobid to pass.
I tried already using ${SLURM_JOBID}
, ${SLURM_JOB_ID}
, %j
and %J
. I also tried to write these slurm env variables into a variable before passing into python.
Here is my latest code:
#!/bin/bash
# --- info to user
echo "script started ... "
# --- setup environment
module purge # clean up
module load python/3.6
module load nvidia/10.0
module load cudnn/10.0-v7
# --- display information
HOST=`hostname`
echo "This script runs the CNN. Slurm scheduled it on node $HOST"
echo "I am interested of all environment variables Slurm adds:"
env | grep -i slurm
# --- start running ...
echo " --- run --- "
# --- define some varibles
dc="dice"
sm="softmax"
# --- run a job using a slurm batch script
for layer in {3..15..2}
do
sbatch -N 1 -n 1 --mem=20G --mail-type=END --gres=gpu:V100:3 --wrap="singularity --noslurm tensorflow_19.03-py3.simg python run_CNN_dynlayer.py ${SLURM_JOBID} ${layer} ${dc}"
sleep 1 # pause 1s to be kind to the scheduler...
echo "jobid: "+${SLURM_JOBID}
echo " --- next --- "
done
cmd looks like that:
femonk@rarp1 [CNN] ./run_CNN_test.slurm
script started ...
This script runs the CNN. Slurm scheduled it on node rarp1
I am interested of all environment variables Slurm adds:
SLURM_ACCOUNT=AI
PYTHONPATH=/cluster/slurm/lib64/python3.6/site-packages:/cluster/slurm/lib64/python3.6/site-packages:/cluster/slurm/lib64/python3.6/site-packages:
--- run ---
Submitted batch job 3182711
jobid:
--- next ---
femonk@rarp1 [CNN]
Has anyone any idea what's wrong with my code? Thanks a lot in advance.
Upvotes: 3
Views: 3787
Reputation: 59250
The SLURM_JOBID
environment variable is made available for the job processes only, not for the process that submits the jobs. The job id is returned from the sbatch
command so if you want it in a variable, you need to assign it.
do
SLURM_JOBID=$(sbatch --parsable -N 1 -n 1 --mem=20G --mail-type=END --gres=gpu:V100:3 --wrap="singularity --noslurm tensorflow_19.03-py3.simg python run_CNN_dynlayer.py ${SLURM_JOBID} ${layer} ${dc}")
sleep 1 # pause 1s to be kind to the scheduler...
echo "jobid: "+${SLURM_JOBID}
echo " --- next --- "
done
Note the use of the command substitution $()
jointly with the --parsable
argument of sbatch
.
Note also that the line Submitted batch job 3182711
of the current output will disappear as it is used to populate the SLURM_JOBID
variable.
Upvotes: 6