How to parallelize this for loop using slurm?

Question

I have a large number of different bam files to process and here is my sbatch file:

#! /bin/bash
#
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=4000
#SBATCH --output=output.%j.out
#SBATCH --error=output.%j.err

srun picard.sh

By this I meant to run using threads=2

And my picard.sh file:

#!/bin/bash

module load picard-tools/2.4.1-gcb01
module load java/1.8.0_45-fasrc01

picard=./picard-tools/2.4.1-gcb01/picard.jar
outdir=./bam2fastq/fastq
tmpdir=./tmp/

for bam in $(find . -type f -name \*.bam);
do
    echo "processing ${bam}"
    s=${bam##*/}
    r1=${s%.bam}_R1.fastq
    r2=${s%.bam}_R2.fastq
    echo $r1
    echo $r2
    java -Djava.io.tmpdir=${tmpdir} -Xmx8G -jar ${picard} SamToFastq \
        I=${bam} \
        FASTQ=${outdir}/${r1} \
        SECOND_END_FASTQ=${outdir}/${r2}
done

While this will process each bam with thread=2 but it will be one by one. How could I run this parallelized such like 6 bam files being processed simultaneously with thread=2?

jeremy · Accepted Answer

Could you try to put your for loop in a function, put your input files in a array and launch job arrays. Something like:

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=4000
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err
#SBATCH --array=0-5


#Loading modules and variables
module load picard-tools/2.4.1-gcb01
module load java/1.8.0_45-fasrc01

picard=./picard-tools/2.4.1-gcb01/picard.jar
outdir=./bam2fastq/fastq
tmpdir=./tmp/

#Array of my inputs
INPUT=( $(find . -type f -name \*.bam) )

#my function
func () {
    bam=$1
    echo "processing ${bam}"
    s=${bam##*/}
    r1=${s%.bam}_R1.fastq
    r2=${s%.bam}_R2.fastq
    echo $r1
    echo $r2
    java -Djava.io.tmpdir=${tmpdir} -Xmx8G -jar ${picard} SamToFastq \
        I=${bam} \
        FASTQ=${outdir}/${r1} \
        SECOND_END_FASTQ=${outdir}/${r2}
}

#launch job arrays
func "${INPUT[$SLURM_ARRAY_TASK_ID]}"

Note 1: you can also limit the number of processes running in parallel in case you get a lot more processes with:

#SBATCH --array=0-1000%100

In this example you will limit the number of simultaneously running tasks from this job array to 100.

Note 2: This question is highly related to this post

Note 3: Slurm doc for job arrays

How to parallelize this for loop using slurm?

Answers (2)

Related Questions