David Z
David Z

Reputation: 7041

How to parallelize this for loop using slurm?

I have a large number of different bam files to process and here is my sbatch file:

#! /bin/bash
#
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=4000
#SBATCH --output=output.%j.out
#SBATCH --error=output.%j.err

srun picard.sh

By this I meant to run using threads=2

And my picard.sh file:

#!/bin/bash

module load picard-tools/2.4.1-gcb01
module load java/1.8.0_45-fasrc01

picard=./picard-tools/2.4.1-gcb01/picard.jar
outdir=./bam2fastq/fastq
tmpdir=./tmp/

for bam in $(find . -type f -name \*.bam);
do
    echo "processing ${bam}"
    s=${bam##*/}
    r1=${s%.bam}_R1.fastq
    r2=${s%.bam}_R2.fastq
    echo $r1
    echo $r2
    java -Djava.io.tmpdir=${tmpdir} -Xmx8G -jar ${picard} SamToFastq \
        I=${bam} \
        FASTQ=${outdir}/${r1} \
        SECOND_END_FASTQ=${outdir}/${r2}
done

While this will process each bam with thread=2 but it will be one by one. How could I run this parallelized such like 6 bam files being processed simultaneously with thread=2?

Upvotes: 2

Views: 3371

Answers (2)

Jeff
Jeff

Reputation: 29

Rather than use a slurm array, I found it easier to handle the parallelization with srun which would be something like this:

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --cpus-per-task=2
#SBATCH --output=output.%j.out

#Loading modules and variables
module load picard-tools/2.4.1-gcb01
module load java/1.8.0_45-fasrc01

for bam_file in $(find . -type f -name \*.bam); do
    srun --ntasks=1 --cpus-per-task=2 \
        picard.sh $bam_file &
done

wait
echo "Finished $(date)"

srun would then process the bam files with 2 cpus each. Note that in picard.sh, you need to replace the for loop with bam=$1.

Upvotes: 0

jeremy
jeremy

Reputation: 228

Could you try to put your for loop in a function, put your input files in a array and launch job arrays. Something like:

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=4000
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err
#SBATCH --array=0-5


#Loading modules and variables
module load picard-tools/2.4.1-gcb01
module load java/1.8.0_45-fasrc01

picard=./picard-tools/2.4.1-gcb01/picard.jar
outdir=./bam2fastq/fastq
tmpdir=./tmp/

#Array of my inputs
INPUT=( $(find . -type f -name \*.bam) )

#my function
func () {
    bam=$1
    echo "processing ${bam}"
    s=${bam##*/}
    r1=${s%.bam}_R1.fastq
    r2=${s%.bam}_R2.fastq
    echo $r1
    echo $r2
    java -Djava.io.tmpdir=${tmpdir} -Xmx8G -jar ${picard} SamToFastq \
        I=${bam} \
        FASTQ=${outdir}/${r1} \
        SECOND_END_FASTQ=${outdir}/${r2}
}

#launch job arrays
func "${INPUT[$SLURM_ARRAY_TASK_ID]}"

Note 1: you can also limit the number of processes running in parallel in case you get a lot more processes with:

#SBATCH --array=0-1000%100

In this example you will limit the number of simultaneously running tasks from this job array to 100.

Note 2: This question is highly related to this post

Note 3: Slurm doc for job arrays

Upvotes: 2

Related Questions