awk for many compressed files

The following command calculates the GC content for each fasta fastq files identified with the find command. Briefly a fastq is a file which for a large number of datapoints have 4 lines of information and where the second line I'm interested in only contains (ATGC). For testing (identical) example files can be found here).

find . -iname '*.fastq' -exec awk '(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}' "{}" \;

How can I modify/rewrite it into a one-liner that works on gziped fastq files? I need the regex option currently used with find.

Upvotes: 0

Views: 181

Answers (2)

dash-o
dash-o

Reputation: 14452

The find '-exec' can be used to invoke (and pass arguments) to a single program. The challenge here is that two commands (cat|awk) need to be combined with a pipe. Two possible path: construct a shell command OR use the more flexible xargs.

# Using the 'shell -c' command
find . -iname '*.fastq.gz' -exec sh -c "zcat {} | awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}'" \;

# OR, using process substitution
find . -iname '*.fastq.gz' -exec bash -c "awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}' <(zcat {})" \;

See many references to find/xargs in stack overflow

Upvotes: 1

Mark Setchell
Mark Setchell

Reputation: 207465

If, as you say, you have many large files, I would suggest processing them in parallel. If the issue is that you are having problems quoting your awk, I would suggest putting your script in a separate file, called, say script.awk like this:

(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}

Now you can simply process them all in parallel with GNU Parallel:

find . -iname \*fastq.gz -print0 | parallel -0 gzcat {} \| awk -f ./script.awk

Upvotes: 1

Related Questions