Reputation: 57
I have a data.frame of intervals given row-wise, the interval starts in column one, the interval ends in column 2. The numbers are not integers. How can I find the overlapping range, if any, of all intervals. e.g:
df <- cbind(c(1.5, 3, 2.1, 1), c(6, 5, 3.7, 10.1))
plot(1:11, ylim = c(0, 5), col = NA)
segments(x0 = c(1.5, 3, 2.1, 1), y0 = 1:4, x1 = c(6, 5, 3.7, 10.1), y1 = 1:4)
abline(v = 3, col = "red", lty = 2)
abline(v = 3.7, col = "red", lty = 2)
somefunc(df)
[1] 3 3.7
A nice, fast base R (or common package like dplyr ect) solution is preferred. I already know of foverlaps
(data.table) and IRranges, but they do not seem to address my problem. For bonus points, if there were interval(s) that prevented total overlap, e.g: rbind'ing c(20, 25)
to df
above, then the function should still return the largest possible overlap of any of the vectors, i.e. still returning c(3, 3.7)
.
EDIT: the solution linked by Henrik is good, but relies on generating a sequence with a given step (e.g. seq(start, end by = 1)
) then reducing them to get the intersection. My intervals may narrower than this step. Ideally I need a solution that operates using logical comparison or something like that. The second solution in the linked page is also not quite right (see below)
EDIT EDIT: The intersection should be returned only if it is common to all ranges. Solution 2 in the post linked by Henrik groups together intervals even if not all intervals in the group intersect with every other interval
Upvotes: 1
Views: 772
Reputation: 42544
Here is a solution which which seems to return the expected result for the given sample datasets.
It takes the vector of all unique interval endpoints and counts the number of intervals they are intersecting (by aggregating in a non-equi join). Among the subset of points with the maximum number of intersections, the range is taken.
library(data.table)
# enhanced dataset with 2 additional intervals
dt <- fread("lb, ub
1.5, 6
3 , 5
2.1, 3.7
1 , 10.1
8.3 , 12
20 , 25")
mdt <- dt[, .(b = unique(unlist(.SD)))]
res <- dt[mdt, on = .(lb <= b, ub >= b), .N, by = .EACHI][N == max(N), range(lb)]
res
[1] 3.0 3.7
library(ggolot2)
ggplot(dt) +
aes(x = lb, y = seq_along(lb), xend = ub, yend = seq_along(ub)) +
geom_segment() +
geom_vline(xintercept = res, col = "red", lty = 2)
The OP has pointed out that the case where there are no overlaps needs to be recognized and handled separately. So I have modified the code:
mdt <- dt[, .(b = unique(unlist(.SD)))]
res <- dt[mdt, on = .(lb <= b, ub >= b), .N, by = .EACHI][
N == max(N), {
if (max(N) > 1) {
cat("Maximum overlaps found:", max(N), "out of", nrow(dt), "intervals\n")
range(lb)
} else {
cat("No overlaps found\n")
NULL
}
}]
This code will recognize the situation where there are no overlaps and will return NULL
in these cases. In addition, a message is printed.
In all other cases, it will print an informative message, e.g.,
Maximum overlaps found: 4 out of 6 intervals
For OP's sample dataset without overlaps
dt <- data.table(lb = c(3, 6, 10), ub = c(5, 9, 15))
it will print
No overlaps found
In case of multiple solutions the code above will return the overall range, i.e, the start of the first interval and the end of the last interval instead of a list of separate intervals.
Sample data for this use case:
dt <- fread("lb, ub
1.5, 6
3 , 5
2.1, 3.7
1 , 10.1
11.5, 16
13 , 15
12.1, 13.7
11 , 20.1
0 , 22
")
So, there is a 5-fold overlap between 3 and 3.7 and a second 5-fold overlap between 13 and 13.7.
Furthermore, there is another use case which needs to be considered: How shall intervals be treated which overlap only in one point, i.e. one interval ends where another starts?
Upvotes: 3