HKJ3
HKJ3

Reputation: 477

Error when using AWK Command for Filtering and Matching Data from Two Files

I have two files one which includes 3 columns, called Chromosome and Gene_start Gene_end. The first column contains a number and Gene_start Gene_end contains a numeric value.

>> cat gene_positions.txt
Chromosome Gene_start Gene_end 
15 26788693 27189686
1 22004791 22115099
20 44689624 44723591
20 61872136 61909046

I have another file with 14 columns, I am only intrested in the first two columns of this file. CHROM and POS:

>> cat gwas_results.txt
CHROM   POS ID  REF ALT A1  TEST    OBS_CT  BETA    SE  L95 U95 T_STAT  P
1   22004791    1:54712:T:TTTTC T   TTTTC   T   ADD 1597    -0.00256645 0.0313666   -0.0640439  0.058911    -0.0818212  0.934799
1   22105000    1:54712:T:TC    T   TTTTC   T   ADD 1597    -0.00259875 0.0313666   -0.0640439  0.058911    -0.0818216  0.94799
1   825410  rs13303179:825410:G:A   G   A   G   ADD 1597    -0.017462   0.0235123   -0.0635454  0.0286213   -0.742676   0.457788

If the first column from file 1 matches exactly to the first column of the second file as well as the second column of gwas_results.txt called POS falls within the range of the second and third column of file 1 (Gene_start Gene_end) for the line to printed in the output_test.txt. Essentially, the gene_positions is a lookup file, with three columns and the first is a number that should match the first column of the gwas_results.txt, next if the POS in gwas_results.txt falls within the range of the Gene_start Gene_end from the first file, I want the line from gwas_results.txt to be printed.

I am using the following awk command:

awk 'NR==FNR {chromosome[$1]; start[$1]; end[$1]; next} 
   ($1 in chromosome) && ($2 >= start[$1]) && ($2 <= end[$1]) {print}'\
 gene_positions.txt gwas_results.txt > test_output_results.txt

I get 0 output and 0 in the error file. What am I doing wrong.

Expected output:

1   22004791    1:54712:T:TTTTC T   TTTTC   T   ADD 1597    -0.00256645 0.0313666   -0.0640439  0.058911    -0.0818212  0.934799
1   22105000    1:54712:T:TC    T   TTTTC   T   ADD 1597    -0.00259875 0.0313666   -0.0640439  0.058911    -0.0818216  0.94799

BECAUSE... The first column of 'gwas_results.txt' ($1) matches a value in the first column of 'gene_positions.txt'. The second column of 'gwas_results.txt' ($2) falls within the range defined by the second and third columns of 'gene_positions.txt' (Gene_start and Gene_end) for chromosome 1.

Upvotes: 2

Views: 97

Answers (2)

anubhava
anubhava

Reputation: 785876

You can use this awk command:

awk '
NR == FNR {
   beg[$1] = $2
   end[$1] = $3
   next
}
$1 in beg && beg[$1] <= $2 && $2 <= end[$1]
' gene_positions.txt gwas_results.txt

1   22004791    1:54712:T:TTTTC T   TTTTC   T   ADD 1597    -0.00256645 0.0313666   -0.0640439  0.058911    -0.0818212  0.934799
1   22105000    1:54712:T:TC    T   TTTTC   T   ADD 1597    -0.00259875 0.0313666   -0.0640439  0.058911    -0.0818216  0.94799

If you want header as well then use:

awk '
NR == FNR {
   beg[$1] = $2
   end[$1] = $3
   next
}
FNR == 1 || ($1 in beg && beg[$1] <= $2 && $2 <= end[$1])
' gene_positions.txt gwas_results.txt

CHROM   POS ID  REF ALT A1  TEST    OBS_CT  BETA    SE  L95 U95 T_STAT  P
1   22004791    1:54712:T:TTTTC T   TTTTC   T   ADD 1597    -0.00256645 0.0313666   -0.0640439  0.058911    -0.0818212  0.934799
1   22105000    1:54712:T:TC    T   TTTTC   T   ADD 1597    -0.00259875 0.0313666   -0.0640439  0.058911    -0.0818216  0.94799

In your code you are using this: start[$1]; end[$1];

which makes start and end associative array with keys as $1 and value as null. You are not storing desired values $2 and $3 in these 2 arrays.

Moreover you don't even need chromosome array just to check for presence of $1 in that array since both start and end arrays will have keys as $1 only so you can either use $1 in start or $1 in end (or just omit this condition altogether like this:

beg[$1] <= $2 && $2 <= end[$1]

Upvotes: 2

markp-fuso
markp-fuso

Reputation: 35266

Addressing why OP's current code does not work ...

OP mentions in the description:

The second column of 'gwas_results.txt' ($2) falls within the range defined by the second and third columns of 'gene_positions.txt' (Gene_start and Gene_end) for chromosome 1.

While in their code we see:

NR==FNR {chromosome[$1]; start[$1]; end[$1]; next} 
                         ^^^^^^^^^  ^^^^^^^

Where start[$1] and end[$1] create array entries with $1 as the index and <nothing> as the value stored in the array entries, ie, the array entries are empty/blank so when we get to the test we're actually testing against empty/blank values:

($2 >= start[$1]) && ($2 <= end[$1])
       ^^^^^^^^^==empty     ^^^^^^^==empty

The solution is to assign values to the array entries, eg:

'NR==FNR {chromosome[$1]; start[$1]=$2; end[$1]=$3; next}
                                   ^^^         ^^^

Upvotes: 0

Related Questions