Reputation: 7041
I have a large number of different bam files to process and here is my sbatch file:
#! /bin/bash
#
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=4000
#SBATCH --output=output.%j.out
#SBATCH --error=output.%j.err
srun picard.sh
By this I meant to run using threads=2
And my picard.sh
file:
#!/bin/bash
module load picard-tools/2.4.1-gcb01
module load java/1.8.0_45-fasrc01
picard=./picard-tools/2.4.1-gcb01/picard.jar
outdir=./bam2fastq/fastq
tmpdir=./tmp/
for bam in $(find . -type f -name \*.bam);
do
echo "processing ${bam}"
s=${bam##*/}
r1=${s%.bam}_R1.fastq
r2=${s%.bam}_R2.fastq
echo $r1
echo $r2
java -Djava.io.tmpdir=${tmpdir} -Xmx8G -jar ${picard} SamToFastq \
I=${bam} \
FASTQ=${outdir}/${r1} \
SECOND_END_FASTQ=${outdir}/${r2}
done
While this will process each bam with thread=2 but it will be one by one. How could I run this parallelized such like 6 bam files being processed simultaneously with thread=2?
Upvotes: 2
Views: 3371
Reputation: 29
Rather than use a slurm array, I found it easier to handle the parallelization with srun
which would be something like this:
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --cpus-per-task=2
#SBATCH --output=output.%j.out
#Loading modules and variables
module load picard-tools/2.4.1-gcb01
module load java/1.8.0_45-fasrc01
for bam_file in $(find . -type f -name \*.bam); do
srun --ntasks=1 --cpus-per-task=2 \
picard.sh $bam_file &
done
wait
echo "Finished $(date)"
srun
would then process the bam files with 2 cpus each. Note that in picard.sh
, you need to replace the for loop with bam=$1
.
Upvotes: 0
Reputation: 228
Could you try to put your for loop in a function, put your input files in a array and launch job arrays. Something like:
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=4000
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err
#SBATCH --array=0-5
#Loading modules and variables
module load picard-tools/2.4.1-gcb01
module load java/1.8.0_45-fasrc01
picard=./picard-tools/2.4.1-gcb01/picard.jar
outdir=./bam2fastq/fastq
tmpdir=./tmp/
#Array of my inputs
INPUT=( $(find . -type f -name \*.bam) )
#my function
func () {
bam=$1
echo "processing ${bam}"
s=${bam##*/}
r1=${s%.bam}_R1.fastq
r2=${s%.bam}_R2.fastq
echo $r1
echo $r2
java -Djava.io.tmpdir=${tmpdir} -Xmx8G -jar ${picard} SamToFastq \
I=${bam} \
FASTQ=${outdir}/${r1} \
SECOND_END_FASTQ=${outdir}/${r2}
}
#launch job arrays
func "${INPUT[$SLURM_ARRAY_TASK_ID]}"
Note 1: you can also limit the number of processes running in parallel in case you get a lot more processes with:
#SBATCH --array=0-1000%100
In this example you will limit the number of simultaneously running tasks from this job array to 100.
Note 2: This question is highly related to this post
Note 3: Slurm doc for job arrays
Upvotes: 2