map two columns in a file to another file using awk

Question

I have a big file A (tab delimited) with ~8 million of rows, the file looks like this:

And another file B that looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   10001   .   T   C   .   .   AF=0.0000384;AC=1;AN=26028;DS=SS6004475
1   10002   .   A   C,T .   .   AF=0.0001153,0.0000384;AC=3,1;AN=26028;DS=HGDP00927|HGDP00998|HGDP01284,HGDP01029
1   10002   .   A   AT  .   .   AF=0.0000384;AC=1;AN=26028;DS=HGDP00521
1   10003   .   A   C,T .   .   AF=0.0000384,0.0000768;AC=1,2;AN=26028;DS=HGDP01284,HGDP00521|HGDP00927
1   10004   .   C   A   .   .   AF=0.0000384;AC=1;AN=26028;DS=HGDP01284
1   10018   .   C   T   .   .   AF=0.0000384;AC=1;AN=26028;DS=HGDP00998
1   10019   rs775809821 TA  T   .   .   AF=0.0000384;AC=1;AN=26028;END=10020;DS=Malay
1   10055   rs768019142 T   TA  .   .   AF=0.0000384;AC=1;AN=26028;DS=Malay 
1   10108   rs62651026  C   T   .   .   AF=0.0000768;AC=2;AN=26028;DS=HGDP00778|HGDP0456

What I want is to match the columns in the file A with the first two columns of file B and add info from the 3rd column of file B to file A. If there is no match I would like to output a NA instead. The desired output would be something like this. The B file also have several commented lines with # that I would like to avoid while reading the file. The output would be something like this:

1 10001 .
1 10005 NA
1 10019 rs775809821
1 10055 rs768019142
1 10108 rs62651026

Thanks

nu11p01n73R · Accepted Answer

You can write something like

awk 'NR==FNR{line[$1" "$2]=$3; next} ($0 in line){print $0,line[$0]; next} {print $0, "NA"}' file2 file1

Example

$ awk 'NR==FNR{line[$1" "$2]=$3; next} ($0 in line){print $0,line[$0]; next} {print $0, "NA"}' file2 file1
1 10001 .
1 10005 NA
1 10019 rs775809821
1 10055 rs768019142
1 10108 rs62651026

What it does?

NR==FNR{line[$1" "$2]=$3; next} If the input file is first file ( which is taken care by NR==FNR ), then we save the $3 in array line indexed by $1" "$2.
($0 in line){print $0,line[$0]; next} For the second file, if the current line is in line then print that line, followed by value in the file2.
{print $0, "NA"} If not print the line followed by NA.

map two columns in a file to another file using awk

Answers (2)

Related Questions