user4670961
user4670961

Reputation: 137

Lookup by multiple columns

General question: I am trying to look up a name based on a reference table using an index. The values of the first columns in my document and reference table need to match, but I then want to check if the numerical value in my file falls between two numerical columns in my reference table. If so, I want to print the corresponding name from the reference table into a new column in my file. Otherwise, I would like to print "NA"

Specific example I want to match the first columns (Chr) of my file and the reference table to see if they match. If so, I want to see if the value in column 2 of the File (pos) is greater than column 3 of the reference (Start) but less than column 4 of the reference (End). If so, I want to add the corresponding col 4 (Gene_name) from the reference table to a new 4th column of my file. If there is no corresponding gene, I want to put "NA" in the gene name column of my file.

File

Chr pos     p-val
2L  1885826 2.638e-08
2L  1996567 5.12e-05
2L  2360597 9.472e-05
2R  2360621 9.472e-05
2R  2360623 9.472e-05

Reference

Chr Start    End        Gene_name
2L  1884260  1888828    FBgn0262029
2L  19531851 19547482   FBgn0052532
2L  2350523  2361570    FBgn0023536
2L  4647871  4648646    FBgn0029718

Desired output

Chr pos     p-val      Gene_name
2L  1885826 2.638e-08  FBgn0262029
2L  1996567 5.12e-05   NA
2L  2360597 9.472e-05  FBgn0023536
2R  2360621 9.472e-05  NA
2R  2360623 9.472e-05  NA

Attempt

I have been trying to adjust the code from this question to do this. awk Lookup 2 files, print match and Sum of Sencond Field:

So far, I have this:

awk -F"\t" '                                       #Set Field separator to     "\t"
FNR==NR {a[$1];a[$2];a[$3];next}                    #Read data from Reference using field #1, field #2, field #3 as index in to array a
{if ($1 in a)                               #Test if field #1 in Table is found in a
{if ($2 > a[$2])
{if ($2 < a[$3])
print $0, a[$4]                       #If found, print line of f1.txt with sum and index from array
else print $0,"NA"     #If not found print line of f1.txt with NotFound
}
else print $0,"NA"     #If not found print line of f1.txt with NotFound
}
else print $0,"NA"     #If not found print line of f1.txt with NotFound
}
' OFS="\t" Referance.txt  File.txt                       #Set Output field separator to , and read files

which produces

2L  1885826 2.638e-08   NA
2L  1996567 5.12e-05    NA
2L  2360597 9.472e-05   NA
2R  2360621 9.472e-05   NA
2R  2360623 9.472e-05   NA

I think I am making a mistake somewhere in the nested if else statements, but I am not sure what I am doing wrong. Would appreciate any suggestions!

Upvotes: 0

Views: 40

Answers (1)

karakfa
karakfa

Reputation: 67467

awk to the rescue!

awk      '{k=$1} 
  NR==FNR {c[k]++; start[k,c[k]]=$2; end[k,c[k]]=$3; gene[k,c[k]]=$4; next} 
          {$(NF+1)=FNR==1?"Gene_name":"NA"} 
  k in c  {for(i=1;i<=c[k];i++) 
              if(start[k,i]<=$2 && $2<=end[k,i]) 
                  {$NF=gene[k,i]; 
                   break}}1' file2 file1 | column -t


Chr  pos      p-val      Gene_name
2L   1885826  2.638e-08  FBgn0262029
2L   1996567  5.12e-05   NA
2L   2360597  9.472e-05  FBgn0023536
2R   2360621  9.472e-05  NA
2R   2360623  9.472e-05  NA

some remarks: assumes the second file fits in memory and ranges don't overlap (if so, only the first match is reported). Does a linear scan so might be slow if either file is large, a better algorithm may start with a sorted ranges (will also validate whether they overlap or not).

Code should be easy to read, essentially loads up the full file2 into multiple arrays and checks the entries of the first file for matching ranges.

Upvotes: 1

Related Questions