CuriousDude
CuriousDude

Reputation: 1117

Executing bash script with parallel for many directories

If I have a bash script (chunks.sh) that execute several mini scripts in parallel, I was wondering how to properly execute chunks.sh so that it runs in parallel for many folders? I have about 1000 folders with files that need to be processed. Here is my script:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --time=16:00:00
#SBATCH --output=mpi_output_%j.txt
#SBATCH --mail-type=FAIL

cd $SLURM_SUBMIT_DIR

module load gcc
module load gnu-parallel
module load bwa
module load samtools

parallel -j 10 < ../1convertfiles.sh
parallel -j 10 < ../2sortfiles.sh
parallel -j 10 < ../3indexfiles.sh
parallel -j 10 < ../4converttopile.sh
parallel -j 10 < ../5createconsensus.sh
parallel -j 10 < ../6concatenateconsensus.sh

Each folder has a name such as THAKID0001_dir, THAKID0010_dir, etc. So I was wondering how to properly apply a command in this script to loop through my current directory, find all the directories with *_dir attached, then execute all these mini scripts within that directory?

I tried putting my parallel commands into for loops but it was rerunning the mini scripts so many times. I think I can use:

parallel -j 10 < 1convertfiles.sh ::: *_dir/*  
parallel -j 10 < 2sortfiles.sh ::: *_dir/*
etc.

But this logic to me seems that each parallel command block will not be running for the SAME directory at once. Each parallel line will be finding it's own directory to work in and these mini scripts have to run in order, hence why I tried writing a for loop but it was creating a huge mess.

Expected Results:

 $ ./chunks.sh
 ### Should run the list of commands per folder ###
 ### For example, it will execute all the parallel commands in THAK0001_dir then it will execute all the parallel commands in THAK0002_dir, etc ####

TL;DR: How to make chunk.sh execute these parallel commandblocks for all directories with a certain tag (i.e. THAK*_dir) but each line should run once the previous line completed. Hope this made sense..thank you!

Upvotes: 2

Views: 1338

Answers (1)

dash-o
dash-o

Reputation: 14452

On surface, the problem require helper script that will perform the sequential processing:

process-dir.sh in $SLURM_SUBMIT_DIR

#! /bin/bash
# Process all jobs for current folder, sequentially.
# Input: Folder, e.g. THAKID0001_dir
cd $1
../1convertfiles.sh
../2sortfiles.sh
../3indexfiles.sh
../4converttopile.sh
../5createconsensus.sh
../6concatenateconsensus.sh

And then run in parallel

#! /bin/bash
cd $SLURM_SUBMIT_DIR

module load gcc
module load gnu-parallel
module load bwa
module load samtools

parallel -j10 process-dir.sh ::: *_dir

Or avoid the file process-dir.sh by including a bash function directly:

#! /bin/bash
cd $SLURM_SUBMIT_DIR

module load gcc
module load gnu-parallel
module load bwa
module load samtools

process-dir() {
  # Process all jobs for current folder, sequentially.
  # Input: Folder, e.g. THAKID0001_dir
  cd "$1"
  ../1convertfiles.sh
  ../2sortfiles.sh
  ../3indexfiles.sh
  ../4converttopile.sh
  ../5createconsensus.sh
  ../6concatenateconsensus.sh
}
export -f process-dir

parallel -j10 process-dir ::: *_dir

Upvotes: 1

Related Questions