Reputation: 37464
Earlier today I saw a - apparently badly formed and therefore already deleted - question about removing overlapping intervals (or ranges, intervals from this on). The question was how to remove intervals which are completely within other intervals. For example, we have the following:
1-2
2-3
1-3
2-4
or if slightly better visualized:
1-2
2-3
1---3
2---4
Intervals 1-2
and 2-3
are both removed since they are included in the interval 1-3
so the output would be:
1-3
2-4
A priori algorithm would probably be to check every interval against every other resulting in O(n2) comparisons. Someone suggested to sort the source data before processing, are there other angles to this problem?
Obvious cases are (data sorted):
1-3 remove
1--4
1-3 remove this or next
1-3
1--4
2-4 remove
1---5
2-4 remove
1-3 print this, maybe next depending on the one after that
2-4
Please, if you come up with nice pitfalls or other cases in the data or affiliated tags add them.
Upvotes: 1
Views: 841
Reputation: 37464
This solution expects the data to be sorted before processing, as suggested by someone:
$ sort -t- -k1n -k2n file # playin' it safe
1-2
1-3
2-3
2-4
In awk:
$ cat program.awk
BEGIN { OFS=FS="-" }
{
if(p=="") { # if p is empty, fill it
p=$0 # p is the previous record
next
}
split(p,b,"-") # p is split to start and end to b[]
if(b[1] == $1 && b[2] <= $2) { # since sorting is expected:
p=$0 # if starts are equal p line is included or identical
next # so remove it
}
else if($2 <= b[2]) # latter is included
next
print p # no complete overlap, print p
p=$0 # and to the next
}
END { print p }
Run it:
$ awk -f program.awk <(sort -t- -k1n -k2n file)
1-3
2-4
or
1-2
2-3
Upvotes: 2
Reputation: 313
As long as the algorithm has polynomial complexity, I think the straightforward solution is ok as well:
#!/usr/bin/gawk -f
BEGIN {
FS=OFS="-";
}
{
arr[NR][1] = $1;
arr[NR][2] = $2;
}
END {
for(i in arr) {
delete_nxt_elem(i);
if(arr[i][1]!="")
print arr[i][1],arr[i][2];
}
}
function delete_nxt_elem(check_indx, j) {
for(j in arr) {
if(j==check_indx)
continue;
if(arr[j][1]<=arr[check_indx][1] && arr[j][2]>=arr[check_indx][2])
delete arr[check_indx];
}
}
Upvotes: 1