Reputation: 477
I have two files one which includes 3 columns, called Chromosome and Gene_start Gene_end. The first column contains a number and Gene_start Gene_end contains a numeric value.
>> cat gene_positions.txt
Chromosome Gene_start Gene_end
15 26788693 27189686
1 22004791 22115099
20 44689624 44723591
20 61872136 61909046
I have another file with 14 columns, I am only intrested in the first two columns of this file. CHROM and POS:
>> cat gwas_results.txt
CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
1 22004791 1:54712:T:TTTTC T TTTTC T ADD 1597 -0.00256645 0.0313666 -0.0640439 0.058911 -0.0818212 0.934799
1 22105000 1:54712:T:TC T TTTTC T ADD 1597 -0.00259875 0.0313666 -0.0640439 0.058911 -0.0818216 0.94799
1 825410 rs13303179:825410:G:A G A G ADD 1597 -0.017462 0.0235123 -0.0635454 0.0286213 -0.742676 0.457788
If the first column from file 1 matches exactly to the first column of the second file as well as the second column of gwas_results.txt called POS falls within the range of the second and third column of file 1 (Gene_start Gene_end) for the line to printed in the output_test.txt. Essentially, the gene_positions is a lookup file, with three columns and the first is a number that should match the first column of the gwas_results.txt, next if the POS in gwas_results.txt falls within the range of the Gene_start Gene_end from the first file, I want the line from gwas_results.txt to be printed.
I am using the following awk command:
awk 'NR==FNR {chromosome[$1]; start[$1]; end[$1]; next}
($1 in chromosome) && ($2 >= start[$1]) && ($2 <= end[$1]) {print}'\
gene_positions.txt gwas_results.txt > test_output_results.txt
I get 0 output and 0 in the error file. What am I doing wrong.
Expected output:
1 22004791 1:54712:T:TTTTC T TTTTC T ADD 1597 -0.00256645 0.0313666 -0.0640439 0.058911 -0.0818212 0.934799
1 22105000 1:54712:T:TC T TTTTC T ADD 1597 -0.00259875 0.0313666 -0.0640439 0.058911 -0.0818216 0.94799
BECAUSE... The first column of 'gwas_results.txt' ($1) matches a value in the first column of 'gene_positions.txt'. The second column of 'gwas_results.txt' ($2) falls within the range defined by the second and third columns of 'gene_positions.txt' (Gene_start and Gene_end) for chromosome 1.
Upvotes: 2
Views: 97
Reputation: 785876
You can use this awk
command:
awk '
NR == FNR {
beg[$1] = $2
end[$1] = $3
next
}
$1 in beg && beg[$1] <= $2 && $2 <= end[$1]
' gene_positions.txt gwas_results.txt
1 22004791 1:54712:T:TTTTC T TTTTC T ADD 1597 -0.00256645 0.0313666 -0.0640439 0.058911 -0.0818212 0.934799
1 22105000 1:54712:T:TC T TTTTC T ADD 1597 -0.00259875 0.0313666 -0.0640439 0.058911 -0.0818216 0.94799
If you want header as well then use:
awk '
NR == FNR {
beg[$1] = $2
end[$1] = $3
next
}
FNR == 1 || ($1 in beg && beg[$1] <= $2 && $2 <= end[$1])
' gene_positions.txt gwas_results.txt
CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
1 22004791 1:54712:T:TTTTC T TTTTC T ADD 1597 -0.00256645 0.0313666 -0.0640439 0.058911 -0.0818212 0.934799
1 22105000 1:54712:T:TC T TTTTC T ADD 1597 -0.00259875 0.0313666 -0.0640439 0.058911 -0.0818216 0.94799
In your code you are using this: start[$1]; end[$1];
which makes start
and end
associative array with keys as $1
and value as null. You are not storing desired values $2
and $3
in these 2 arrays.
Moreover you don't even need chromosome
array just to check for presence of $1
in that array since both start
and end
arrays will have keys as $1
only so you can either use $1 in start
or $1 in end
(or just omit this condition altogether like this:
beg[$1] <= $2 && $2 <= end[$1]
Upvotes: 2
Reputation: 35266
Addressing why OP's current code does not work ...
OP mentions in the description:
The second column of 'gwas_results.txt' ($2) falls within the range defined by the second and third columns of 'gene_positions.txt' (Gene_start and Gene_end) for chromosome 1.
While in their code we see:
NR==FNR {chromosome[$1]; start[$1]; end[$1]; next}
^^^^^^^^^ ^^^^^^^
Where start[$1]
and end[$1]
create array entries with $1
as the index and <nothing>
as the value stored in the array entries, ie, the array entries are empty/blank so when we get to the test we're actually testing against empty/blank values:
($2 >= start[$1]) && ($2 <= end[$1])
^^^^^^^^^==empty ^^^^^^^==empty
The solution is to assign values to the array entries, eg:
'NR==FNR {chromosome[$1]; start[$1]=$2; end[$1]=$3; next}
^^^ ^^^
Upvotes: 0