Reputation: 1546
I am writing to ask for an explanation for some of the elements of this short AWK command, which I am using to print fields from test-file_long.txt which match fields in input test-file_short.txt. The code works fine-- I would just like to know exactly what the program is doing since I am very new to programming and I would like to be able to think on my toes for future commands that I will need to write. Here is the example:
$ cat test-file_long.txt
2 41647 41647 A G
2 45895 45895 A G
2 45953 45953 T C
2 224919 224919 A G
2 230055 230055 C G
2 233239 233239 A G
2 234130 234130 T G
$ cat test-file_short.txt
2 41647 41647 A G
2 45895 45895 A G
2 FALSE 224919 A G
2 233239 233239 A G
2 234130 234130 T G
$ awk 'NR==FNR{a[$2];next}$2 in a{print $0,FNR}' test-file_short.txt test-file_long.txt
2 41647 41647 A G 1
2 45895 45895 A G 2
2 233239 233239 A G 6
2 234130 234130 T G 7
It is a very simple matching problem for which I found the commands on this site a few weeks ago. My questions are 1) what exactly does NR==FNR
do? I know that it stands for number of records = number of records of the current input file, respectively, but why is this necessary for the code to operate? When I remove this from the command, the result is the same as paste test-file_long.txt test-file_short.txt
. 2) for $2 in a
, does AWK automatically read field 2 from file 2 as part of the syntax here? 3) I just want to confirm that ;next
just means to skip all other blocks and go to next line? So in other words the code first performs a[$2]
for every line and then goes back and performs the other blocks for each line? When I remove ;next
I still get the filtered output but only trailing a full printout of test-file_short.txt
.
Thanks for any and all input, my goal is just to understand better how AWK works, since it has been extraordinarily useful for my current work (processing large genomics datasets).
Upvotes: 1
Views: 257
Reputation: 40778
Here are some information related to your code:
NR==FNR
will only be valid for the first file. Since, for file number 2, FNR
will start from 1 again, whereas NR
continues to increase.
$2 in a
will only be executed for file number 2, this is due to the next
statement inside the first rule. Due to this next
statement, the second rule will never be reached for file number 1.
Upvotes: 2