Reputation: 1446
I have a big file A
(tab delimited) with ~8 million of rows, the file looks like this:
1 10001
1 10005
1 10019
1 10055
1 10108
And another file B
that looks like this:
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10001 . T C . . AF=0.0000384;AC=1;AN=26028;DS=SS6004475
1 10002 . A C,T . . AF=0.0001153,0.0000384;AC=3,1;AN=26028;DS=HGDP00927|HGDP00998|HGDP01284,HGDP01029
1 10002 . A AT . . AF=0.0000384;AC=1;AN=26028;DS=HGDP00521
1 10003 . A C,T . . AF=0.0000384,0.0000768;AC=1,2;AN=26028;DS=HGDP01284,HGDP00521|HGDP00927
1 10004 . C A . . AF=0.0000384;AC=1;AN=26028;DS=HGDP01284
1 10018 . C T . . AF=0.0000384;AC=1;AN=26028;DS=HGDP00998
1 10019 rs775809821 TA T . . AF=0.0000384;AC=1;AN=26028;END=10020;DS=Malay
1 10055 rs768019142 T TA . . AF=0.0000384;AC=1;AN=26028;DS=Malay
1 10108 rs62651026 C T . . AF=0.0000768;AC=2;AN=26028;DS=HGDP00778|HGDP0456
What I want is to match the columns in the file A
with the first two columns of file B
and add info from the 3rd column of file B
to file A
. If there is no match I would like to output a NA
instead. The desired output would be something like this. The B
file also have several commented lines with #
that I would like to avoid while reading the file. The output would be something like this:
1 10001 .
1 10005 NA
1 10019 rs775809821
1 10055 rs768019142
1 10108 rs62651026
Thanks
Upvotes: 2
Views: 1056
Reputation: 9622
Does your 1st column only contain numerical values? Would these contain more than 1-2 digits? Would you expect numbers greater than 5-9 digits in the 2nd column? Are both files numerically sorted (i.e. sort -k1n,1 -k2n,2
)? Is the combination of columns 1 and 2 unique in each file (or do they contain duplicates)? If you can answer yes to all of the above questions, then here is another solution.
join -a1 -o 1.2,1.3,2.2 -e "NA" \
<(awk '{printf("%02d:%09d\t%s\t%s\n", $1, $2, $1, $2)}' A) \
<(cat B | grep -v ^# | awk '{printf("%02d:%09d\t%s\n", $1, $2, $3)}')
If file B
is gzipped, then just use zcat B
or gunzip -c B
instead of cat B
.
Upvotes: 0
Reputation: 26667
You can write something like
awk 'NR==FNR{line[$1" "$2]=$3; next} ($0 in line){print $0,line[$0]; next} {print $0, "NA"}' file2 file1
Example
$ awk 'NR==FNR{line[$1" "$2]=$3; next} ($0 in line){print $0,line[$0]; next} {print $0, "NA"}' file2 file1
1 10001 .
1 10005 NA
1 10019 rs775809821
1 10055 rs768019142
1 10108 rs62651026
What it does?
NR==FNR{line[$1" "$2]=$3; next}
If the input file is first file ( which is taken care by NR==FNR
), then we save the $3
in array line
indexed by $1" "$2
.
($0 in line){print $0,line[$0]; next}
For the second file, if the current line is in line
then print that line, followed by value in the file2
.
{print $0, "NA"}
If not print the line followed by NA
.
Upvotes: 3