avgJoe
avgJoe

Reputation: 842

How to monitor running screen session and start a new session once last one ended?

To run my neural network training, I start different training configurations using the following scripts:

NNtrain.sh

#!/bin/bash
echo "Start 1st screen"
screen -dmS NN48001 bash -c '../NNrun.sh NN48001 hyperparam_48_001.json 0 0.5'
echo "Start 2nd screen"
screen -dmS NN480001 bash -c '../NNrun.sh NN480001 hyperparam_48_0001.json 0 0.5'
echo "Start 3rd screen"
screen -dmS NN4800001 bash -c '../NNrun.sh NN4800001 hyperparam_48_00001.json 1 0.5'
echo "Start 4th screen"
screen -dmS NN48000001 bash -c '../NNrun.sh NN48000001 hyperparam_48_000001.json 2 0.5'

NNrun.sh

#!/bin/bash
if [ -f "/opt/anaconda/etc/profile.d/conda.sh" ]; then
    . "/opt/anaconda/etc/profile.d/conda.sh"
    CONDA_CHANGEPS1=false conda activate PyTorchNN
    echo "Activated conda env"
fi
echo $1
python main_broad_FEA.py --hyperparam-json $2 --GPU $3 --varstop $4

Now, I have 3GPUs in my machine and would like to batch-train more networks i.e. start the next training after the last one ended. Thus, I would like to monitor which screen-sessions have closed i.e. returned and then start a new screen session using the GPU used by the screen session that just ran.

How can I check if and which of my screen sessions returned, so that I can start the next one using a bash script?

(Note: If it unnecessarily more complicated doing this in a bash script, then please feel free to propose a suitable alternative.)

Upvotes: 1

Views: 410

Answers (2)

Ole Tange
Ole Tange

Reputation: 33685

#!/bin/bash

doit() {
    if [ -f "/opt/anaconda/etc/profile.d/conda.sh" ]; then
      . "/opt/anaconda/etc/profile.d/conda.sh"
      CONDA_CHANGEPS1=false conda activate PyTorchNN
      echo "Activated conda env"
    fi
    echo $1
    python main_broad_FEA.py --hyperparam-json $2 --GPU $3 --varstop $4
}
export -f doit

parallel -j 3 {1} {2} {%} {3} ::: a r g 1 ::: a r g 2 ::: a r g 4

Explanation:

  • "-j 3" defines the number of job slots
  • "{1}" is replaced by the respective element from "arg1" - likewise for {2} and {3}
  • {%} is the job slot number. We use this to determine which GPU to run on. (See page 30 of https://doi.org/10.5281/zenodo.1146014)
  • ":::" is followed by a list of arguments (one per job)

If you want to monitor the running jobs live, you can do that using tmux:

parallel --tmux -j 3 {1} {2} {%} {3} ::: a r g 1 ::: a r g 2 ::: a r g 4

Upvotes: 1

avgJoe
avgJoe

Reputation: 842

As I presented my question above, I was over-complicating the problem.

The solution that I used in the end, was to add touch "$3.GPUFREE" to the end of my NNrun.sh script. This creates an empty file ".GPUFREE" when the NNrun.sh script terminates. Finally, I run a loop that checks if a ".GPUFREE" file was created from my NNtrain.sh and thus know which GPU was freed up. Then the script just deleted the file and starts the next job on that GPU.

Upvotes: 0

Related Questions