How to check whether one number range from one file is the subset of other number range from other file?

Question

I'm trying to find out whether range1 numbers [both columns a and b] are the subset or lying between range2's columns [both columns b and c].

range1

a        b
15       20
 8       10
37       44
32       37

range2

    a       b       c
    chr1    6       12
    chr2    13      21
    chr3    31      35
    chr4    36      45

output:

a       b       c
chr1    6       12       8       10
chr2    13      21       15      20
chr4    36      45       37      44

I wanted to compare range1[a] with range2[b] and range1[b] with range2[c]. One to all comparison.

For example in the first run: the first row of range-1 with all other rows of range-2. But range1[a] should be compared only with range2[b] and similarly, range1[b] should be compared only with range2[c]. Based on this only I have written a criteria :

lbs[i] && lbsf1[j] <= ubs[i] && ubsf1[j] >= lbs[i] && ubsf1[j] <= ubs[i]

r1[a] r2[b] r1[b] r2[c]
15 > 6      20 < 12     False
15 > 13     20 < 21     True
15 > 31     20 < 35     False
15 > 36     20 < 45     False

I have tried to learn from this code [which is working if we wanted to check if a single number is lying in a specific range], therefore I tried modifying the same for two both numbers. But did not work, I'm feeling I'm not able to read the second file properly.

Code: [reference but little modified]

    #!/bin/bash

awk -F'	' '
# 1st pass (fileB): read the lower and upper range bounds
FNR==NR { lbs[++count] = $2+0; ubs[count] = $3+0; next }
# 2nd pass (fileA): check each line against all ranges.
{ lbsf1[++countf1] = $1+0; ubsf1[countf1] = $2+0;

        for(i=1;i<=count;++i)
                {
                        for(j=1;j<=countf1;++j)
                        {
                        if (lbsf1[j] >= lbs[i] && lbsf1[j] <= ubs[i] && ubsf1[j] >= lbs[i] && ubsf1[j] <= ubs[i])
                                { print lbs[i]"	"ubs[i]"	"lbsf1[j]"	"ubsf1[j] ; next }
                        }
                }
}
' range2 range1

This code gave me output:

6       12      8       10
6       12      8       10
6       12      8       10

Thank you.

markp-fuso · Accepted Answer

Assumptions:

input files do not have a b nor a b c as the first line (we can modify the proposed code if these lines really do exist in the data)
lines in range2 do not have leading white space (as shown in the provided sample)
while not demonstrated by the small sample provided, going to assume that a row from range1 may 'match' with multiple rows from range2 and that we want to print all matches (we can modify the proposed code if we need to stop processing a range1 row once we find the first 'match')

Sample data:

$ cat range1
15      20
 8      10
37      44
32      37

$ cat range2
chr1    6       12
chr2    13      21
chr3    31      35
chr4    36      45
chr15   36      67             # added to demonstrate multi-match for range1 [ 37 , 44 ]

Issues with current code:

loads the range1 data into an array and then loops over this (ever growing array) for each line read from range1; this array is unnecessary as we just need to process the current row from range1
the dual loop logic is aborted (; next) upon printing the first matching set of records; this premature cancellation means we only see the first match ... over and over; the ; next can be removed
the range2[a] column is not captured during range2 input processing so we're unable to display this column in the final output

Updating OP's current code to address these issues:

awk '
BEGIN   { FS=OFS="	" }

FNR==NR { chromo[++count]=$1
          lbs[count]=$2
          ubs[count]=$3
          next
        }

        { lb=$1
          ub=$2

          for (i=1;i<=count;++i)
              if ( lb >= lbs[i] && lb <= ubs[i] && ub >= lbs[i] && ub <= ubs[i] )
                 print chromo[i],lbs[i],ubs[i],lb,ub
        }
' range2 range1

This generates:

chr2    13      21      15      20
chr1    6       12       8      10
chr4    36      45      37      44
chr15   36      67      37      44

If the output needs to be sorted we could modify the awk code to store the results in another array and then during END {...} processing sort and print the array. But for simplicity sake we'll just pipe the output to sort, eg:

$ awk ' BEGIN { FS=OFS="	" } FNR==NR ....' range2 range1 | sort -V
chr1    6       12       8      10
chr2    13      21      15      20
chr4    36      45      37      44
chr15   36      67      37      44

How to check whether one number range from one file is the subset of other number range from other file?

Answers (1)

Related Questions