Reputation: 696
I am having trouble using awk
NR==FNR
to return lines of interest from an input .fastq file.
I have the following example input file called example.fastq
@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
I am trying to extract groups of four lines that contain a string of interest, importantly approximate matches must be allowed hence the use of agrep instead of grep. The below example works.
agrep -1 -n "GAAATAATA" example.fastq | awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - example.fastq
The above command produces the following correct output.
@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
However if I use a sequence not contained in the second line this command still prints the top two lines as in the following example.
agrep -1 -n "TAGATAAAACT" example.fastq | awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - example.fastq
@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
Thanks for helping me understand the behavior of this awk command.
Upvotes: 0
Views: 609
Reputation: 785376
You may use this agrep + awk
solution:
srch() {
awk -F ': ' 'NR==FNR {
a[$1] = 1
next
}
a[FNR] {
print p
print
for (i=0; i<2 && getline > 0; i++)
print
}
{
p=$0
}' <(agrep -1 -n "$2" "$1") "$1"
}
Then run it as:
srch file 'GAAATAATA'
@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
and this:
srch file 'TAGATAAAACT
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6'
Upvotes: 1
Reputation: 16176
There are no colons (:
) in your input, so $1
refers to the whole line and ($1-1)
& ($2+2)
are going to be -1
and 2
respectively, meaning your for
loop will always run exactly four times (for values of i
equal to -1
, 0
, 1
, then 2
).
Inside the for
loop, you're assuring that a[i]
exists (that's a[-1]
, a[0]
, a[1]
, and a[2]
).
The final part of your code prints the line being examined at that time (but not from the first file thanks to the next
in the prior stanza) whenever the array a
contains an entry for that file's line number. Therefore, it prints lines 1 and 2 from each input (since a[FNR]
exists for FNR
equal to 1 or 2).
Since you need an approximate answer and therefore must use agrep
, the idea proposed by James Brown's answer to your other question makes sense but its implementation (as dissected above) does not.
The following solution uses agrep
's hits as cues for surrounding lines to print alongside the hits (agrep
does not support context lines like grep
's -A NUM
and -B num
or else we'd be able to do agrep -A1 -B2 -1 -n PATTERN example.fastq
for a simpler answer).
agrep -1 "GAAATAATA" example.fastq | awk '
NR == FNR { agrep_hit[$0] = 1; next }
agrep_hit[$0] { print last_line; i = 1 }
0 < i && i < 4 { i++; print }
{ last_line = $0 }
' - example.fastq
This examines the input file twice. The first time uses agrep
to find the approximate pattern matches while the second uses awk
to get the requested lines of context.
When the overall line number in awk
(NR
) equals the local file's line number (FNR
), it means we're examining the first input (-
, standard input, which is the output of agrep
). We store the approximate pattern hits in an associative array for later and then move on to the next line with next
(therefore the rest of the awk
commands only operate on later inputs).
Since you need the previous line, we have to print it explicitly. The final stanza of awk
code saves the current line as last_line
so we can retrieve it later. Upon a line that was outputted by agrep
(and thus stored in our array), we print that saved last_line
and set an iterator i
to 1
.
When i
is 1
, 2
, or 3
, we increment it and print the current line. This prints the matching line and then two more for context.
Upvotes: 1
Reputation: 67497
with record separator definition (GNU awk
)
$ awk -v RS='(^|\n)@' '/GAAATAATA/{printf "%s", rt $0} {rt=RT}' file
@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
Upvotes: 0