user2380782
user2380782

Reputation: 1446

map two columns in a file to another file using awk

I have a big file A (tab delimited) with ~8 million of rows, the file looks like this:

1 10001
1 10005
1 10019
1 10055
1 10108

And another file B that looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   10001   .   T   C   .   .   AF=0.0000384;AC=1;AN=26028;DS=SS6004475
1   10002   .   A   C,T .   .   AF=0.0001153,0.0000384;AC=3,1;AN=26028;DS=HGDP00927|HGDP00998|HGDP01284,HGDP01029
1   10002   .   A   AT  .   .   AF=0.0000384;AC=1;AN=26028;DS=HGDP00521
1   10003   .   A   C,T .   .   AF=0.0000384,0.0000768;AC=1,2;AN=26028;DS=HGDP01284,HGDP00521|HGDP00927
1   10004   .   C   A   .   .   AF=0.0000384;AC=1;AN=26028;DS=HGDP01284
1   10018   .   C   T   .   .   AF=0.0000384;AC=1;AN=26028;DS=HGDP00998
1   10019   rs775809821 TA  T   .   .   AF=0.0000384;AC=1;AN=26028;END=10020;DS=Malay
1   10055   rs768019142 T   TA  .   .   AF=0.0000384;AC=1;AN=26028;DS=Malay 
1   10108   rs62651026  C   T   .   .   AF=0.0000768;AC=2;AN=26028;DS=HGDP00778|HGDP0456

What I want is to match the columns in the file A with the first two columns of file B and add info from the 3rd column of file B to file A. If there is no match I would like to output a NA instead. The desired output would be something like this. The B file also have several commented lines with # that I would like to avoid while reading the file. The output would be something like this:

1 10001 .
1 10005 NA
1 10019 rs775809821
1 10055 rs768019142
1 10108 rs62651026

Thanks

Upvotes: 2

Views: 1056

Answers (2)

tommy.carstensen
tommy.carstensen

Reputation: 9622

Does your 1st column only contain numerical values? Would these contain more than 1-2 digits? Would you expect numbers greater than 5-9 digits in the 2nd column? Are both files numerically sorted (i.e. sort -k1n,1 -k2n,2)? Is the combination of columns 1 and 2 unique in each file (or do they contain duplicates)? If you can answer yes to all of the above questions, then here is another solution.

join -a1 -o 1.2,1.3,2.2 -e "NA" \
<(awk '{printf("%02d:%09d\t%s\t%s\n", $1, $2, $1, $2)}' A) \
<(cat B | grep -v ^# | awk '{printf("%02d:%09d\t%s\n", $1, $2, $3)}')

If file B is gzipped, then just use zcat B or gunzip -c B instead of cat B.

Upvotes: 0

nu11p01n73R
nu11p01n73R

Reputation: 26667

You can write something like

awk 'NR==FNR{line[$1" "$2]=$3; next} ($0 in line){print $0,line[$0]; next} {print $0, "NA"}' file2 file1

Example

$ awk 'NR==FNR{line[$1" "$2]=$3; next} ($0 in line){print $0,line[$0]; next} {print $0, "NA"}' file2 file1
1 10001 .
1 10005 NA
1 10019 rs775809821
1 10055 rs768019142
1 10108 rs62651026

What it does?

  • NR==FNR{line[$1" "$2]=$3; next} If the input file is first file ( which is taken care by NR==FNR ), then we save the $3 in array line indexed by $1" "$2.

  • ($0 in line){print $0,line[$0]; next} For the second file, if the current line is in line then print that line, followed by value in the file2.

  • {print $0, "NA"} If not print the line followed by NA.

Upvotes: 3

Related Questions