Reputation: 586
The following command calculates the GC content for each fasta fastq files identified with the find command. Briefly a fastq is a file which for a large number of datapoints have 4 lines of information and where the second line I'm interested in only contains (ATGC). For testing (identical) example files can be found here).
find . -iname '*.fastq' -exec awk '(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}' "{}" \;
How can I modify/rewrite it into a one-liner that works on gziped fastq files? I need the regex option currently used with find.
Upvotes: 0
Views: 181
Reputation: 14452
The find
'-exec' can be used to invoke (and pass arguments) to a single program. The challenge here is that two commands (cat|awk) need to be combined with a pipe. Two possible path: construct a shell command OR use the more flexible xargs.
# Using the 'shell -c' command
find . -iname '*.fastq.gz' -exec sh -c "zcat {} | awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}'" \;
# OR, using process substitution
find . -iname '*.fastq.gz' -exec bash -c "awk '(NR%4==2) \
{N1+=length(\$0);gsub(/[AT]/,\"\");N2+=length(\$0);}END{print N2/N1;}' <(zcat {})" \;
See many references to find/xargs in stack overflow
Upvotes: 1
Reputation: 207465
If, as you say, you have many large files, I would suggest processing them in parallel. If the issue is that you are having problems quoting your awk
, I would suggest putting your script in a separate file, called, say script.awk
like this:
(NR%4==2) {N1+=length($0);gsub(/[AT]/,"");N2+=length($0);}END{print N2/N1;}
Now you can simply process them all in parallel with GNU Parallel:
find . -iname \*fastq.gz -print0 | parallel -0 gzcat {} \| awk -f ./script.awk
Upvotes: 1