Reputation: 822
This is building upon an earlier question Awk conditional filter one file based on another (or other solutions)
Quick summary at bottom of question
I have an awk program that outputs a column from rows in a text file 'refGene.txt if values in that row match 2 out of 3 values in another text file.
I need to include an additional criteria for finding a match between the two files. The criteria is inclusion if the range of the 2 numberical values specified in each row in file 1 overlap with the range of the two values in a row in refGene.txt. An example of a line in File 1:
chr1 10 20
chr2 10 20
and an example line in file 2(refGene.txt) of the matching columns ($3, $5, $ 6):
chr1 5 30
Currently the awk program does not treat this as a match because although the first column matches neither the 2nd or 3rd columns do no. But I would like a way to treat this as a match because the region 10-20 in file 1 is WITHIN the range of 5-30 in refGene.txt. However the second line in file 1 should NOT match because the first column does not match, which is necessary. If there is a way to include cases when any of the range in file 1 overlaps with any of the range in refGene.txt that would be really helpful (so partial overlap is also counted as a match). It should also replace the below conditional statements as it would also find all the cases currently described below.
So a summary: Want awk to print a match if: $1 in file1 matches $3 in file 2 AND: The range of $2-$3 in file1 intersects at all with the range of $5-$6 in file2
Please let me know if my question is unclear. Any help is really appreciated, thanks it advance! (solutions do not have to be in awk)
Rubal
FILES=/files/*txt
for f in $FILES ;
do
awk '
BEGIN {
FS = "\t";
}
FILENAME == ARGV[1] {
pair[ $1, $2, $3 ] = 1;
next;
}
{
if ( pair[ $3, $5, $6 ] == 1 ) {
print $13;
}
}
' $(basename $f) /files/refGene.txt > /files/results/$(basename $f) ;
done
Upvotes: 2
Views: 1611
Reputation: 247102
You just need to use 2 arrays:
awk -F '\t' '
NR == FNR {min[$1] = $2; max[$1] = $3; next}
($3 in min) && (min[$3] >= $5) && (max[$3] <= $6) {print $13}
'
NR==FNR
is just another way to write FILENAME == ARGV[1]
-- it looks at line numbers instead of filenames.
Upvotes: 0