Paul
Paul

Reputation: 696

awk NR==FNR for command syntax

I am having trouble using awk NR==FNR to return lines of interest from an input .fastq file.

I have the following example input file called example.fastq

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

I am trying to extract groups of four lines that contain a string of interest, importantly approximate matches must be allowed hence the use of agrep instead of grep. The below example works.

agrep -1 -n "GAAATAATA" example.fastq | awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - example.fastq

The above command produces the following correct output.

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

However if I use a sequence not contained in the second line this command still prints the top two lines as in the following example.

agrep -1 -n "TAGATAAAACT" example.fastq | awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - example.fastq

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

Thanks for helping me understand the behavior of this awk command.

Upvotes: 0

Views: 609

Answers (3)

anubhava
anubhava

Reputation: 785376

You may use this agrep + awk solution:

srch() {
   awk -F ': ' 'NR==FNR {
      a[$1] = 1
      next
   }
   a[FNR] {
      print p
      print
      for (i=0; i<2 && getline > 0; i++)
         print
   }
   {
      p=$0
   }' <(agrep -1 -n "$2" "$1") "$1"
}

Then run it as:

srch file 'GAAATAATA'

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

and this:

srch file 'TAGATAAAACT

@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6'

Upvotes: 1

Adam Katz
Adam Katz

Reputation: 16176

There are no colons (:) in your input, so $1 refers to the whole line and ($1-1) & ($2+2) are going to be -1 and 2 respectively, meaning your for loop will always run exactly four times (for values of i equal to -1, 0, 1, then 2).

Inside the for loop, you're assuring that a[i] exists (that's a[-1], a[0], a[1], and a[2]).

The final part of your code prints the line being examined at that time (but not from the first file thanks to the next in the prior stanza) whenever the array a contains an entry for that file's line number. Therefore, it prints lines 1 and 2 from each input (since a[FNR] exists for FNR equal to 1 or 2).

Since you need an approximate answer and therefore must use agrep, the idea proposed by James Brown's answer to your other question makes sense but its implementation (as dissected above) does not.

The following solution uses agrep's hits as cues for surrounding lines to print alongside the hits (agrep does not support context lines like grep's -A NUM and -B num or else we'd be able to do agrep -A1 -B2 -1 -n PATTERN example.fastq for a simpler answer).

agrep -1 "GAAATAATA" example.fastq | awk '
  NR == FNR { agrep_hit[$0] = 1; next }
  agrep_hit[$0] { print last_line; i = 1 }       
  0 < i && i < 4 { i++; print } 
  { last_line = $0 }
' - example.fastq

This examines the input file twice. The first time uses agrep to find the approximate pattern matches while the second uses awk to get the requested lines of context.

When the overall line number in awk (NR) equals the local file's line number (FNR), it means we're examining the first input (-, standard input, which is the output of agrep). We store the approximate pattern hits in an associative array for later and then move on to the next line with next (therefore the rest of the awk commands only operate on later inputs).

Since you need the previous line, we have to print it explicitly. The final stanza of awk code saves the current line as last_line so we can retrieve it later. Upon a line that was outputted by agrep (and thus stored in our array), we print that saved last_line and set an iterator i to 1.

When i is 1, 2, or 3, we increment it and print the current line. This prints the matching line and then two more for context.

Upvotes: 1

karakfa
karakfa

Reputation: 67497

with record separator definition (GNU awk)

$ awk -v RS='(^|\n)@' '/GAAATAATA/{printf "%s", rt $0} {rt=RT}' file

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

Upvotes: 0

Related Questions