Reputation: 137
General question: I am trying to look up a name based on a reference table using an index. The values of the first columns in my document and reference table need to match, but I then want to check if the numerical value in my file falls between two numerical columns in my reference table. If so, I want to print the corresponding name from the reference table into a new column in my file. Otherwise, I would like to print "NA"
Specific example I want to match the first columns (Chr) of my file and the reference table to see if they match. If so, I want to see if the value in column 2 of the File (pos) is greater than column 3 of the reference (Start) but less than column 4 of the reference (End). If so, I want to add the corresponding col 4 (Gene_name) from the reference table to a new 4th column of my file. If there is no corresponding gene, I want to put "NA" in the gene name column of my file.
File
Chr pos p-val
2L 1885826 2.638e-08
2L 1996567 5.12e-05
2L 2360597 9.472e-05
2R 2360621 9.472e-05
2R 2360623 9.472e-05
Reference
Chr Start End Gene_name
2L 1884260 1888828 FBgn0262029
2L 19531851 19547482 FBgn0052532
2L 2350523 2361570 FBgn0023536
2L 4647871 4648646 FBgn0029718
Desired output
Chr pos p-val Gene_name
2L 1885826 2.638e-08 FBgn0262029
2L 1996567 5.12e-05 NA
2L 2360597 9.472e-05 FBgn0023536
2R 2360621 9.472e-05 NA
2R 2360623 9.472e-05 NA
Attempt
I have been trying to adjust the code from this question to do this. awk Lookup 2 files, print match and Sum of Sencond Field:
So far, I have this:
awk -F"\t" ' #Set Field separator to "\t"
FNR==NR {a[$1];a[$2];a[$3];next} #Read data from Reference using field #1, field #2, field #3 as index in to array a
{if ($1 in a) #Test if field #1 in Table is found in a
{if ($2 > a[$2])
{if ($2 < a[$3])
print $0, a[$4] #If found, print line of f1.txt with sum and index from array
else print $0,"NA" #If not found print line of f1.txt with NotFound
}
else print $0,"NA" #If not found print line of f1.txt with NotFound
}
else print $0,"NA" #If not found print line of f1.txt with NotFound
}
' OFS="\t" Referance.txt File.txt #Set Output field separator to , and read files
which produces
2L 1885826 2.638e-08 NA
2L 1996567 5.12e-05 NA
2L 2360597 9.472e-05 NA
2R 2360621 9.472e-05 NA
2R 2360623 9.472e-05 NA
I think I am making a mistake somewhere in the nested if else statements, but I am not sure what I am doing wrong. Would appreciate any suggestions!
Upvotes: 0
Views: 40
Reputation: 67467
awk
to the rescue!
awk '{k=$1}
NR==FNR {c[k]++; start[k,c[k]]=$2; end[k,c[k]]=$3; gene[k,c[k]]=$4; next}
{$(NF+1)=FNR==1?"Gene_name":"NA"}
k in c {for(i=1;i<=c[k];i++)
if(start[k,i]<=$2 && $2<=end[k,i])
{$NF=gene[k,i];
break}}1' file2 file1 | column -t
Chr pos p-val Gene_name
2L 1885826 2.638e-08 FBgn0262029
2L 1996567 5.12e-05 NA
2L 2360597 9.472e-05 FBgn0023536
2R 2360621 9.472e-05 NA
2R 2360623 9.472e-05 NA
some remarks: assumes the second file fits in memory and ranges don't overlap (if so, only the first match is reported). Does a linear scan so might be slow if either file is large, a better algorithm may start with a sorted ranges (will also validate whether they overlap or not).
Code should be easy to read, essentially loads up the full file2 into multiple arrays and checks the entries of the first file for matching ranges.
Upvotes: 1