Reputation: 357
I am looking to flag rows in my df that have overlapping ranges (looking to create the Overlap Column) based a range of numeric variables (Min,Max), which I could transform into integer if necessary:
Class Min Max
A 100 200
A 120 205
A 210 310
A 500 630
A 510 530
A 705 800
Transform into:
Class Min Max Overlap
A 100 200 1
A 120 205 1
A 210 310 0
A 500 630 1
A 510 530 1
A 705 800 0
I have tried IRanges without much success - any ideas?
Upvotes: 5
Views: 770
Reputation: 2950
You can use the ivs package for this, which is a specialized package for working with intervals. You can use iv_count_overlaps()
to count the number of self-overlaps, and then filter that for any time you saw >1 overlap (you will always have at least 1 overlap because each interval will match itself).
library(ivs)
library(dplyr)
df <- tibble(Class = c("A", "A", "A", "A", "A", "A"),
Min = c(100, 120, 210, 500, 510, 705),
Max = c(200, 205, 310, 630, 530, 800))
df <- df %>%
mutate(Range = iv(Min, Max), .keep = "unused")
df %>%
mutate(Overlap = iv_count_overlaps(Range, Range) > 1L)
#> # A tibble: 6 × 3
#> Class Range Overlap
#> <chr> <iv<dbl>> <lgl>
#> 1 A [100, 200) TRUE
#> 2 A [120, 205) TRUE
#> 3 A [210, 310) FALSE
#> 4 A [500, 630) TRUE
#> 5 A [510, 530) TRUE
#> 6 A [705, 800) FALSE
Upvotes: 0
Reputation: 4671
I find data.table very effective for doing overlaps, using foverlaps
library(data.table)
Recreating the data:
dt <- data.table(Class = c("A", "A", "A", "A", "A", "A"),
Min = c(100, 120, 210, 500, 510, 705),
Max = c(200, 205, 310, 630, 530, 800))
Keying the data.table, this is required for the function:
setkey(dt, Min, Max)
here we do foverlaps
against itself, then filter, removing those rows which are overlapping with themselves. The number of rows are then counted grouped by Min
and Max
.
dt_overlaps <- foverlaps(dt, dt, type = "any")[Min != i.Min & Max != i.Max, .(Class, Overlap = .N), by = c("Min", "Max")]
Thanks to DavidArenburg
dt[dt_overlaps, Overlap := 1]
Results:
> dt
Class Min Max Overlap
1 A 100 200 1
2 A 120 205 1
3 A 210 310 NA
4 A 500 630 1
5 A 510 530 1
6 A 705 800 NA
There is probably neater data.table code for this, but I'm learning as well.
Upvotes: 3
Reputation: 4554
library(dplyr)
df_foo%>%mutate(flag=coalesce(ifelse(Max>lead(Min),1,NA),ifelse(lag(Max)>Min,1,NA)))
Class Min Max flag
1 A 100 200 1
2 A 120 205 1
3 A 210 310 NA
4 A 500 630 1
5 A 510 530 1
6 A 705 800 NA
Upvotes: 0
Reputation: 10954
outer
is my function of choice for doing pairwise comparisons fast. You can create the pairwise comparison of the interval endpoints using outer
and then combine the comparisons in any way you want. In this case I check if the two rules required for an overlap hold true simultaneously.
library(dplyr)
df_foo = read.table(
textConnection("Class Min Max
A 100 200
A 120 205
A 210 310
A 500 630
A 510 530
A 705 800"), header = TRUE
)
c = outer(df_foo$Max, df_foo$Min, ">")
d = outer(df_foo$Min, df_foo$Max, "<")
df_foo %>%
mutate(Overlap = apply(c & d, 1, sum) > 1
)
Upvotes: 2