awk for many compressed files

Question

The following command calculates the GC content for each fasta fastq files identified with the find command. Briefly a fastq is a file which for a large number of datapoints have 4 lines of information and where the second line I'm interested in only contains (ATGC). For testing (identical) example files can be found here).

find . -iname '*.fastq' -exec awk '(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}' "{}" \;

How can I modify/rewrite it into a one-liner that works on gziped fastq files? I need the regex option currently used with find.

dash-o · Accepted Answer

The find '-exec' can be used to invoke (and pass arguments) to a single program. The challenge here is that two commands (cat|awk) need to be combined with a pipe. Two possible path: construct a shell command OR use the more flexible xargs.

# Using the 'shell -c' command
find . -iname '*.fastq.gz' -exec sh -c "zcat {} | awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}'" \;

# OR, using process substitution
find . -iname '*.fastq.gz' -exec bash -c "awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}' <(zcat {})" \;

See many references to find/xargs in stack overflow

awk for many compressed files

Answers (2)

Related Questions