How can I make this AWK array matching unambiguous?

Question

I have large data tables (~10M lines and ~4M lines) which I would like to array match on [$1,$2]. Both fields are numeric only as shown in this example from the head of the 4M file1 followed by the head of the 10M line file2:

$ head -5 pantro2-hg19-liftover.frq 
1   868476  A:0.388889
1   868841  A:0.666667
1   873398  A:0.555556
1   879624  A:0.05
1   879821  A:0.0625
$ head -5 tot_YRI10.frq 
CHROM   POS N_ALLELES   N_CHR   {ALLELE:FREQ}
1   30923   2   20  T:0.35  G:0.65
1   52238   2   20  G:0.55  T:0.45
1   54676   2   20  T:0.05  C:0.95
1   55164   2   20  A:0.55  C:0.45

Unfortunately, it seems that AWK makes ambiguous matches if part of [$1,$2] matches $1,$2 in file 2. When I use the following command, all 10M lines of file2 are returned:

$ awk 'NR==FNR{YRI[$1,$2];next} $1,$2 in YRI {print $1,$2,$NF}' 
pantro2-hg19-liftover.frq tot_YRI10.frq | 
head -5
CHROM POS {ALLELE:FREQ}
1 30923 G:0.65
1 52238 T:0.45
1 54676 C:0.95
1 55164 C:0.45

My desired output is the lines of file2 that match file1 on the columns 1 and 2. There should only be about 15K matches in there. I'm not sure what about array matching is ambiguous in this case.

Ed Morton · Accepted Answer

You should be using $1,$2, not $1$2, as the array index.

You used $1,$2 in YRI as the condition. Change that to ($1,$2) in YRI.

x,y is the syntax for specifying a range of conditions to print between when true (typically statements like /start/,/end/) while (x,y) is the syntax for creating an array index for use with the in operator.

When you write $1,$2 in YRI you are writing ($1),($2 in YRI) which is telling awk to start printing from the first line where $1 is non-zero/null (which it presumably is on the first line of your file) to the line where $2 in YRI is true (which it presumably never will be) so you print the whole file.

How can I make this AWK array matching unambiguous?

Answers (1)

Related Questions