Reputation: 282
Hello I need to iterate over pairs of files and do something with them.
For example I have 4 files which are named AA2234_1.fastq.gz
AA2234_2.fastq.gz
AA3945_1.fastq.gz
AA3945_2.fastq.gz
As you can propably tell the pairs are AA2234_1.fastq.gz
<-> AA2234_2.fastq.gz
and AA3945_1.fastq.gz
<-> AA3945_2.fastq.gz
(they share the name before _
sign)
I have a command
with syntax looking like this:
initialize_of_command file1 file2 output_a output_b output_c output_d parameteres
I want this script to find the number of files with fastq.gz
extension in a directory, divide them by 2 to find number of pairs then match the pairs together using probably regex (maybe to two variables) and execute this command
for each pair once.
I have no idea how to pair up those files using regex and how to iterate over the pairs so the scripts knows through which pairs it have already iterated.
Here is my unfinished script:
#!/bin/bash
raw_count_of_files=$(ls | grep -c "fastq.gz")
count_of_files=$((raw_count_of_files / 2))
for ((i=1;i<=count_of_files;i++));
do
java -jar /home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar PE -phred33 AA2234_1.fastq.gz AA2234_2.fastq.gz AA2234_forward_paired.fq.gz AA2234_forward_unpaired.fq.gz AA2234_reverse_paired.fq.gz AA2234_reverse_unpaired.fq.gz SLIDINGWINDOW:4:20 MINLEN:20;
done
Also I would like for the output names to be named after the shared name of input files which in this case is AA2234
and AA3945
The desired output of this script should be 8 files named accordingly to pairs:
AA2234_forward_paired.fq.gz
AA2234_forward_unpaired.fq.gz
AA2234_reverse_paired.fq.gz
AA2234_reverse_unpaired.fq.gz
and
AA3945_forward_paired.fq.gz
AA3945_forward_unpaired.fq.gz
AA3945_reverse_paired.fq.gz
AA3945_reverse_unpaired.fq.gz
Upvotes: 1
Views: 1201
Reputation: 10123
If there are exactly two files for each prefix (prefix is the portion before the _
in filename), then this job could be accomplished by using a simple for
without resorting to arrays:
#!/bin/bash
jarfile='/home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar'
prefix=
for file in *_*.fastq.gz; do
if [[ $prefix ]]; then
echo java -jar "$jarfile" PE -phred33 \
"$first" "$file" "$prefix"_{forward,reverse}_{,un}paired.fq.gz \
'SLIDINGWINDOW:4:20' 'MINLEN:20'
prefix=
else
first=$file
prefix=${file%%_*}
fi
done
Drop the echo
if the command printed out looks good.
Upvotes: 0
Reputation: 19545
One way to iterate over pairs of arguments:
#!/usr/bin/env sh
proc_fastq_pairs() {
# loop while there are fastq files passed as argument
while [ $# -gt 0 ]; do
fq1=$1
# consume 1 argument as file 1
shift
fq2=$1
# consume 1 argument as file 2
shift
initialize_of_command "$fq1" "$fq2" output_a output_b output_c output_d parameteres
done
}
initialize_of_command() {
# dummy command to show passed arguments for debug purpose
printf 'initialize_of_command %s\n' "$*"
}
# Expansion of the globbing pattern ./*.fastq.gz
# is always sorted alphabetically.
# It ensures all similarly named files are kept
# togaether fq1 fq2 ...
proc_fastq_pairs ./*.fastq.gz
Alternatively with xargs
:
printf '%s\n' ./*.fastq.gz | xargs -L 2 bash -c 'initialize_of_command "$1" "$2" output_a output_b output_c output_d parameteres' _
Upvotes: 2
Reputation: 75478
#!/bin/bash
declare -A assoc=()
shopt -s nullglob
for f in *_?.fastq.gz; do
base=${f%_*}
assoc[$base]=${assoc[$base]-}${assoc[$base]+ }$f
done
set -f
for pair in "${assoc[@]}"; do
set -- $pair
# TODO: Check $# and do something with $1 and $2
done
Upvotes: 1
Reputation: 22012
Assuming the filenames do not contain whitespace, would you please try:
#!/bin/bash
declare -A hash # associative array to tie basename with files
for f in *fastq.gz; do # search the files with the suffix
base=${f%_*} # remove after "_"
if [[ -z ${hash[$base]} ]]; then # if the variable is not defined
hash[$base]=$f # then store the filename
else
hash[$base]+=" $f" # else append the filenmame delimited by the whitespace
fi
done
for base in "${!hash[@]}"; do # loop over the hash keys (basename)
read -r f1 f2 <<< "${hash[$base]}" # split into filenames
echo java -jar /home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar PE -phred33 "$f1" "$f2" "$base"_forward_paired.fq.gz "$base"_forward_unpaired.fq.gz "$base"_reverse_paired.fq.gz "$base"_reverse_unpaired.fq.gz SLIDINGWINDOW:4:20 MINLEN:20;
done
The script outputs the java command lines as a dry run. If the output looks good, drop echo
and run.
Upvotes: 2