Reputation: 477
I have posted previosuly but I have not resolved the issue. I have two files one which includes 3 columns, called Chromosome and Gene_start Gene_end. There are approx 100 lines in this file. The first column contains a number (1-23) and Gene_start Gene_end contains a numeric value. If I filter for 4 in the first column or chromosome 4, my file looks like this:
>> cat gene_positions.txt
Chromosome Gene_start Gene_end
4 25121627 25167204
4 170981373 171017850
4 37592422 37692998
4 84382094 84411290
I have another file with 14 columns, I am only interested in the first two columns of this file. CHROM and POS:
>> cat gwas_results.txt
CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
1 22004791 1:54712:T:TTTTC T TTTTC T ADD 1597 -0.00256645 0.0313666 -0.0640439 0.058911 -0.0818212 0.934799
1 22105000 1:54712:T:TC T TTTTC T ADD 1597 -0.00259875 0.0313666 -0.0640439 0.058911 -0.0818216 0.94799
1 825410 rs13303179:825410:G:A G A G ADD 1597 -0.017462 0.0235123 -0.0635454 0.0286213 -0.742676 0.457788
If the first column from file 1 matches exactly to the first column of the second file as well as the second column of gwas_results.txt called POS falls within the range of the second and third column of file 1 (Gene_start Gene_end) for that particular line, then to print the line in the lookup_output.txt. Essentially, the gene_positions is a lookup file, with three columns and the first is a number that should match the first column of the gwas_results.txt, next if the POS in gwas_results.txt falls within the range of the Gene_start Gene_end from the first file, I want the line from gwas_results.txt to be printed.
I am using the following awk command as was suggested before:
#!/bin/bash
#SBATCH --job-name=genelookup
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16gb
#SBATCH --time=01:00:00
#SBATCH --export=NONE
cd /data/genome/hj/GWAS_results
awk '
NR == FNR {
start[$1] = $2
end[$1] = $3
next
}
$1 in start && $2 >= start[$1] && $3 <= end[$1]
' gene_positions.txt gwas_results.txt > lookup_output.txt
What I get are some positions outside the range in my lookup_output.txt
4 84529777 4:84529777:ACGTG:A ACGTG A ACGTG ADD 1635 0.00986994 0.0253137 -0.039744 0.0594839 0.389905 0.696658
4 85092377 4:84392377:G:T G T T ADD 1635 -0.00336762 0.0415539 -0.0848117 0.0780765 -0.0810422 0.935418
For example 84529777 is outside the range from the gene_positions file (4 84382094 84411290). I need the lookup to be line to line basis. E.g. scans the file for chromosome 4 and positions within 84382094 84411290 range. If found print.
I also get
4 37700998 4:84392377:G:T G T A ADD 1635 -0.0033 0.047539 -0.0847 0.0780765 -0.08 0.93
But this is outside the range (4 37592422 37692998) from the gene positions.
I really should be seeing
4 25121630 4:84392377:G:T G T T ADD 1635 -0.00336762 0.0415539 -0.0848117 0.0780765 -0.0810422 0.935418
4 25131627 4:84392378:A:T A T T ADD 1635 -0.00336762 0.0415539 -0.0848117 0.0780765 -0.0810422 0.935418
4 170981374 4:84415536:CA:C CA C C ADD 1635 -0.0359081 0.0576552 -0.14891 0.0770941 -0.622807 0.533499
4 171017750 4:84415882:AAACATG:A AAACATG A A ADD 1635 -0.00552875 0.0522986 -0.108032 0.0969745 -0.105715 0.915821
4 37492998 4:84454092:T:TTATATATG T TTATATATG TTATATATG ADD 1635 0.0148902 0.0442058 -0.0717516 0.101532 0.336838 0.736283
4 37592998 4:84458776:G:GTA G GTA GTA ADD 1635 -0.000157512 0.0252463
etc...
BECAUSE...they second column called POS is all within the range of the positions in the gene_positions.txt file, providing it is chromosome 4. In other words, the first column of 'gwas_results.txt' ($1) matches a value in the first column of 'gene_positions.txt' and if this is true, then the second column of 'gwas_results.txt' ($2) falls within the range defined by the second and third columns of 'gene_positions.txt' (Gene_start and Gene_end) for the particular chromosome.
Upvotes: 0
Views: 37