jaegger
jaegger

Reputation: 88

SLURM sbatch script not running all srun commands in while loop

I'm trying to submit multiple jobs in parallel as a preprocessing step in sbatch using srun. The loop reads a file containing 40 file names and uses "srun command" on each file. However, not all files are being sent off with srun and the rest of the sbatch script continues after the ones that did get submitted finish. The real sbatch script is more complicated and I can't use arrays with this so that won't work. This part should be pretty straightforward though.

I made this simple test case as a sanity check and it does the same thing. For every file name in the file list (40) it creates a new file containing 'foo' in it. Every time I submit the script with sbatch it results in a different number of files being sent off with srun.

#!/bin/sh
#SBATCH --job-name=loop
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00
#SBATCH --mem-per-cpu=1G
#SBATCH -A zheng_lab
#SBATCH -p exacloud
#SBATCH --error=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/log_files/test.%J.err
#SBATCH --output=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/log_files/test.%J.out

DIR=/home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel
SAMPLES=$DIR/samples.txt
OUT_DIR=$DIR/test_out
FOO_FILE=$DIR/foo.txt

# Create output directory
srun -N 1 -n 1 -c 1 mkdir $OUT_DIR

# How many files to run
num_files=$(srun -N 1 -n 1 -c 1 wc -l $SAMPLES)
echo "Number of input files: " $num_files

# Create a new file for every file in listing (run 5 at a time, 1 for each node)
while read F  ;
do
    fn="$(rev <<< "$F" | cut -d'/' -f 1 | rev)" # Remove path for writing output to new directory
    echo $fn
    srun -N 1 -n 1 -c 1 cat $FOO_FILE > $OUT_DIR/$fn.out &
done <$SAMPLES
wait

# How many files actually got created
finished=$(srun -N 1 -n 1 -c 1 ls -lh $OUT_DIR/*out | wc -l)
echo "Number of files submitted: " $finished

Here is my output log file the last time I tried to run it:

Number of input files:  40 /home/exacloud/lustre1/zheng_lab/users/eggerj/Dissertation/splice_net_prototype/beatAML_data/splicing_quantification/test_build_parallel/samples.txt
sample1
sample2
sample3
sample4
sample5
sample6
sample7
sample8
Number of files submitted:  8

Upvotes: 0

Views: 3578

Answers (1)

damienfrancois
damienfrancois

Reputation: 59300

The issue is that srun redirects its stdin to the tasks it starts, and therefore the contents of $SAMPLES is consumed, in an unpredictable way, by all the cat commands that are started.

Try with

srun --input none -N 1 -n 1 -c 1 cat $FOO_FILE > $OUT_DIR/$fn.out &

The --input none parameter will tell srun to not mess with stdin.

Upvotes: 1

Related Questions