user964689
user964689

Reputation: 822

Awk matching between two files when regions intersect (any solutions welcome)

This is building upon an earlier question Awk conditional filter one file based on another (or other solutions)

Quick summary at bottom of question

I have an awk program that outputs a column from rows in a text file 'refGene.txt if values in that row match 2 out of 3 values in another text file.

I need to include an additional criteria for finding a match between the two files. The criteria is inclusion if the range of the 2 numberical values specified in each row in file 1 overlap with the range of the two values in a row in refGene.txt. An example of a line in File 1:

chr1 10 20
chr2 10 20

and an example line in file 2(refGene.txt) of the matching columns ($3, $5, $ 6):

chr1 5 30

Currently the awk program does not treat this as a match because although the first column matches neither the 2nd or 3rd columns do no. But I would like a way to treat this as a match because the region 10-20 in file 1 is WITHIN the range of 5-30 in refGene.txt. However the second line in file 1 should NOT match because the first column does not match, which is necessary. If there is a way to include cases when any of the range in file 1 overlaps with any of the range in refGene.txt that would be really helpful (so partial overlap is also counted as a match). It should also replace the below conditional statements as it would also find all the cases currently described below.

So a summary: Want awk to print a match if: $1 in file1 matches $3 in file 2 AND: The range of $2-$3 in file1 intersects at all with the range of $5-$6 in file2

Please let me know if my question is unclear. Any help is really appreciated, thanks it advance! (solutions do not have to be in awk)

Rubal

FILES=/files/*txt   
for f in $FILES ;
do

    awk '
        BEGIN {
            FS = "\t";
        }
        FILENAME == ARGV[1] {
            pair[ $1, $2, $3 ] = 1;
            next;
        }
        {
            if ( pair[ $3, $5, $6 ] == 1 ) {
                print $13;
            }
        }
    ' $(basename $f) /files/refGene.txt > /files/results/$(basename $f) ;
done

Upvotes: 2

Views: 1611

Answers (1)

glenn jackman
glenn jackman

Reputation: 247102

You just need to use 2 arrays:

awk -F '\t' '
  NR == FNR {min[$1] = $2; max[$1] = $3; next}
  ($3 in min) && (min[$3] >= $5) && (max[$3] <= $6) {print $13}
'

NR==FNR is just another way to write FILENAME == ARGV[1] -- it looks at line numbers instead of filenames.

Upvotes: 0

Related Questions