Rgeek
Rgeek

Reputation: 449

Looping files in bash

I want to loop over these kind of files, where the the files with same Sample_ID have to be used together

Sample_51770BL1_R1.fastq.gz
Sample_51770BL1_R2.fastq.gz

Sample_52412_R1.fastq.gz
Sample_52412_R2.fastq.gz

e.g. Sample_51770BL1_R1.fastq.gz and Sample_51770BL1_R2.fastq.gz are used together in one command to create an output.

Similarly, Sample_52412_R1.fastq.gz and Sample_52412_R2.fastq.gz are used together to create output.

I want to write a for loop in bash to iterate over and create output.

sourcedir=/sourcepath/
destdir=/destinationpath/


bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta Sample_52412_R1.fastq.gz  Sample_52412_R2.fastq.gz>$destdir/Sample_52412_R1_R2.sam

How should I pattern match the file names Sample_ID_R1 and Sample_ID_R2 to be used in one command?

Thanks,

Upvotes: 0

Views: 2498

Answers (1)

John1024
John1024

Reputation: 113814

for fname in *_R1.fastq.gz
do
    base=${fname%_R1*}
    bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz"  "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam"
done

In the comments, you ask about running several, but not too many, jobs in parallel. Below is my first stab at that:

#!/bin/bash
# Limit background jobs to no more that $maxproc at once.
maxproc=3

for fname in *  # _R1.fastq.gz
do
    while [ $(jobs | wc -l) -ge "$maxproc" ]
    do
        sleep 1
    done
    base=${fname%_R1*}
    echo starting new job with ongoing=$(jobs | wc -l)
    bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done

The optimal value of maxproc will depend on how many processors your PC has. You may need to experiment to find what works best.

Note that the above script uses jobs which is a bash builtin function. Thus, it has to be run under bash, not dash which is the default for scripts under Debian-like distributions.

Upvotes: 4

Related Questions