antekkalafior
antekkalafior

Reputation: 282

Looping over pairs of files

Hello I need to iterate over pairs of files and do something with them.

For example I have 4 files which are named AA2234_1.fastq.gz AA2234_2.fastq.gz AA3945_1.fastq.gz AA3945_2.fastq.gz

As you can propably tell the pairs are AA2234_1.fastq.gz <-> AA2234_2.fastq.gz and AA3945_1.fastq.gz <-> AA3945_2.fastq.gz (they share the name before _ sign)

I have a command with syntax looking like this:

initialize_of_command file1 file2 output_a output_b output_c output_d parameteres

I want this script to find the number of files with fastq.gz extension in a directory, divide them by 2 to find number of pairs then match the pairs together using probably regex (maybe to two variables) and execute this command for each pair once.

I have no idea how to pair up those files using regex and how to iterate over the pairs so the scripts knows through which pairs it have already iterated.

Here is my unfinished script:

#!/bin/bash
raw_count_of_files=$(ls | grep -c "fastq.gz")
count_of_files=$((raw_count_of_files / 2))

for ((i=1;i<=count_of_files;i++));
do
java -jar /home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar PE -phred33 AA2234_1.fastq.gz AA2234_2.fastq.gz AA2234_forward_paired.fq.gz AA2234_forward_unpaired.fq.gz AA2234_reverse_paired.fq.gz AA2234_reverse_unpaired.fq.gz SLIDINGWINDOW:4:20 MINLEN:20;
done

Also I would like for the output names to be named after the shared name of input files which in this case is AA2234 and AA3945

The desired output of this script should be 8 files named accordingly to pairs:

AA2234_forward_paired.fq.gz 
AA2234_forward_unpaired.fq.gz 
AA2234_reverse_paired.fq.gz 
AA2234_reverse_unpaired.fq.gz

and

AA3945_forward_paired.fq.gz 
AA3945_forward_unpaired.fq.gz 
AA3945_reverse_paired.fq.gz 
AA3945_reverse_unpaired.fq.gz

Upvotes: 1

Views: 1201

Answers (4)

M. Nejat Aydin
M. Nejat Aydin

Reputation: 10123

If there are exactly two files for each prefix (prefix is the portion before the _ in filename), then this job could be accomplished by using a simple for without resorting to arrays:

#!/bin/bash

jarfile='/home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar'

prefix=
for file in *_*.fastq.gz; do
    if [[ $prefix ]]; then
        echo java -jar "$jarfile" PE -phred33 \
            "$first" "$file" "$prefix"_{forward,reverse}_{,un}paired.fq.gz \
            'SLIDINGWINDOW:4:20' 'MINLEN:20'
        prefix=
    else
        first=$file
        prefix=${file%%_*}
    fi
done

Drop the echo if the command printed out looks good.

Upvotes: 0

L&#233;a Gris
L&#233;a Gris

Reputation: 19545

One way to iterate over pairs of arguments:

#!/usr/bin/env sh

proc_fastq_pairs() {
  # loop while there are fastq files passed as argument
  while [ $# -gt 0 ]; do
    fq1=$1
    # consume 1 argument as file 1
    shift
    fq2=$1
    # consume 1 argument as file 2
    shift
    initialize_of_command "$fq1" "$fq2" output_a output_b output_c output_d parameteres
  done
}

initialize_of_command() {
  # dummy command to show passed arguments for debug purpose
  printf 'initialize_of_command %s\n' "$*"
}

# Expansion of the globbing pattern ./*.fastq.gz
# is always sorted alphabetically.
# It ensures all similarly named files are kept
# togaether fq1 fq2 ...
proc_fastq_pairs ./*.fastq.gz

Alternatively with xargs:

printf '%s\n' ./*.fastq.gz | xargs -L 2 bash -c 'initialize_of_command "$1" "$2" output_a output_b output_c output_d parameteres' _

Upvotes: 2

konsolebox
konsolebox

Reputation: 75478

#!/bin/bash

declare -A assoc=()
shopt -s nullglob

for f in *_?.fastq.gz; do
    base=${f%_*}
    assoc[$base]=${assoc[$base]-}${assoc[$base]+ }$f
done

set -f

for pair in "${assoc[@]}"; do
    set -- $pair
    # TODO: Check $# and do something with $1 and $2
done

Upvotes: 1

tshiono
tshiono

Reputation: 22012

Assuming the filenames do not contain whitespace, would you please try:

#!/bin/bash

declare -A hash                         # associative array to tie basename with files
for f in *fastq.gz; do                  # search the files with the suffix
    base=${f%_*}                        # remove after "_"
    if [[ -z ${hash[$base]} ]]; then    # if the variable is not defined
        hash[$base]=$f                  # then store the filename
    else
        hash[$base]+=" $f"              # else append the filenmame delimited by the whitespace
    fi
done

for base in "${!hash[@]}"; do           # loop over the hash keys (basename)
    read -r f1 f2 <<< "${hash[$base]}"  # split into filenames

    echo java -jar /home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar PE -phred33 "$f1" "$f2" "$base"_forward_paired.fq.gz "$base"_forward_unpaired.fq.gz "$base"_reverse_paired.fq.gz "$base"_reverse_unpaired.fq.gz SLIDINGWINDOW:4:20 MINLEN:20;
done

The script outputs the java command lines as a dry run. If the output looks good, drop echo and run.

Upvotes: 2

Related Questions