Reputation: 1546
I have large data tables (~10M lines and ~4M lines) which I would like to array match on [$1,$2]
. Both fields are numeric only as shown in this example from the head of the 4M file1 followed by the head of the 10M line file2:
$ head -5 pantro2-hg19-liftover.frq
1 868476 A:0.388889
1 868841 A:0.666667
1 873398 A:0.555556
1 879624 A:0.05
1 879821 A:0.0625
$ head -5 tot_YRI10.frq
CHROM POS N_ALLELES N_CHR {ALLELE:FREQ}
1 30923 2 20 T:0.35 G:0.65
1 52238 2 20 G:0.55 T:0.45
1 54676 2 20 T:0.05 C:0.95
1 55164 2 20 A:0.55 C:0.45
Unfortunately, it seems that AWK makes ambiguous matches if part of [$1,$2]
matches $1,$2
in file 2. When I use the following command, all 10M lines of file2 are returned:
$ awk 'NR==FNR{YRI[$1,$2];next} $1,$2 in YRI {print $1,$2,$NF}'
pantro2-hg19-liftover.frq tot_YRI10.frq |
head -5
CHROM POS {ALLELE:FREQ}
1 30923 G:0.65
1 52238 T:0.45
1 54676 C:0.95
1 55164 C:0.45
My desired output is the lines of file2 that match file1 on the columns 1 and 2. There should only be about 15K matches in there. I'm not sure what about array matching is ambiguous in this case.
Upvotes: 0
Views: 120
Reputation: 204184
You should be using $1,$2
, not $1$2
, as the array index.
You used $1,$2 in YRI
as the condition. Change that to ($1,$2) in YRI
.
x,y
is the syntax for specifying a range of conditions to print between when true (typically statements like /start/,/end/
) while (x,y)
is the syntax for creating an array index for use with the in
operator.
When you write $1,$2 in YRI
you are writing ($1),($2 in YRI)
which is telling awk to start printing from the first line where $1
is non-zero/null (which it presumably is on the first line of your file) to the line where $2 in YRI
is true (which it presumably never will be) so you print the whole file.
Upvotes: 4