Reputation: 14192
Say I have this dataframe, which has two IDs (1/2) with their start and end times in three different zones (A/B/C):
df <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), zone = c("A",
"B", "A", "C", "B", "A", "B", "A", "B", "C"), start = c(0, 6,
7, 8, 10, 0, 3, 5, 6, 7), end = c(6, 7, 8, 10, 11, 3, 5, 6, 7,
11)), row.names = c(NA, -10L), class = "data.frame")
df
id zone start end
1 1 A 0 6
2 1 B 6 7
3 1 A 7 8
4 1 C 8 10
5 1 B 10 11
6 2 A 0 3
7 2 B 3 5
8 2 A 5 6
9 2 B 6 7
10 2 C 7 11
If we look at each zone, we can visually inspect the times when IDs are in the same zone and when they are not:
split(df,df$zone)
$A
id zone start end
1 1 A 0 6
3 1 A 7 8
6 2 A 0 3
8 2 A 5 6
$B
id zone start end
2 1 B 6 7
5 1 B 10 11
7 2 B 3 5
9 2 B 6 7
$C
id zone start end
4 1 C 8 10
10 2 C 7 11
e.g. Both 1 and 2 are together in zone A from 0-3, and from 5-6, but not at other times.
Desired Output
I want to extract three dataframes.
zone start end id
1 A 0 3 1-2
2 A 5 6 1-2
3 B 6 7 1-2
4 C 8 10 1-2
2 & 3: Dataframes for times when they are not together:
#id=1
zone start end
1 A 3 5
2 A 7 8
3 B 10 11
#id=2
zone start end
1 B 3 5
2 C 7 8
3 C 10 11
I have been trying to work with foverlaps
from data.table
and the intervals
package, but can't seem to work out the correct method.
e.g. Subsetting each zone/id, I can sort of get an output that includes overlaps, but it doesn't seem to be quite the right direction:
A <- split(df,df$zone)$A
Asp <- split(A,A$id)
x <- setDT(Asp[[1]])
y <- setDT(Asp[[2]])
setkey(y, start, end)
foverlaps(x, y, type="any")
id zone start end i.id i.zone i.start i.end
1: 2 A 0 3 1 A 0 6
2: 2 A 5 6 1 A 0 6
3: NA <NA> NA NA 1 A 7 8
Any help greatly appreciated.
EDIT: Extra example dataset that seemed to bring up some issues with current suggested solutions:
df2 <- structure(list(start = c(0, 5, 6, 8, 10, 13, 15, 20, 22, 26,
29, 37, 40, 42, 0, 3, 6, 9, 15, 20, 25, 33, 35, 40), id = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2), zone = c("A", "B", "A", "D", "C", "B", "C", "B", "A",
"B", "A", "D", "C", "D", "A", "B", "C", "D", "A", "B", "C", "B",
"A", "D"), end = c(5, 6, 8, 10, 13, 15, 20, 22, 26, 29, 37, 40,
42, 45, 3, 6, 9, 15, 20, 25, 33, 35, 40, 45)), class = c("data.table", "data.frame"), row.names = c(NA, -24L))
df2
start id zone end
1: 0 1 A 5
2: 5 1 B 6
3: 6 1 A 8
4: 8 1 D 10
5: 10 1 C 13
6: 13 1 B 15
7: 15 1 C 20
8: 20 1 B 22
9: 22 1 A 26
10: 26 1 B 29
11: 29 1 A 37
12: 37 1 D 40
13: 40 1 C 42
14: 42 1 D 45
15: 0 2 A 3
16: 3 2 B 6
17: 6 2 C 9
18: 9 2 D 15
19: 15 2 A 20
20: 20 2 B 25
21: 25 2 C 33
22: 33 2 B 35
23: 35 2 A 40
24: 40 2 D 45
start id zone end
Upvotes: 5
Views: 286
Reputation: 21908
Updated Solution
I have made some modifications to the previous solution so that it works with the newly presented data set df2
:
id == 1
and id == 2
in every zone to try and find their intersectsid
s to extract their start
to end
values so that we have two vectors and we can find their intersects easily
` In the end I applied this function to every subset of our data setlibrary(dplyr)
library(tidyr)
library(purrr)
fn <- function(data, x, y) {
base::intersect(data %>%
filter(row_number() == x) %>%
select(start, end) %>%
{map2(.$start, .$end, ~ .x:.y)} %>%
unlist(),
data %>%
filter(row_number() == y) %>%
select(start, end) %>%
{map2(.$start, .$end, ~ .x:.y)} %>%
unlist())
}
Then we apply it on our data set:
split(df2, df2$zone) %>%
map(~ .x %>%
mutate(grp = row_number()) %>%
{expand.grid(.$grp[.$id == 1], .$grp[.$id == 2])} %>%
rowwise() %>%
mutate(insec = list(fn(.x, Var1, Var2))) %>%
filter(length(insec) != 0) %>%
unnest(cols = c(insec)) %>%
group_by(Var1, Var2) %>%
filter(row_number() == 1 | row_number() == n()) %>%
filter(n() > 1) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = insec) %>%
ungroup()) %>%
keep(~ nrow(.x) != 0) %>%
imap_dfr(~ .x %>%
mutate(zone
= .y) %>%
select(!starts_with("Var"))) %>%
relocate(zone) %>%
rename(start = `1`, end = `2`)
# A tibble: 6 x 3
zone start end
<chr> <int> <int>
1 A 0 3
2 A 35 37
3 B 5 6
4 B 20 22
5 D 9 10
6 D 42 45
Upvotes: 3
Reputation: 25225
For the first data.frame, you can also use a non-equi join:
ovlap <- df[df, on=.(zone, id<id, start<end, end>start), nomatch=0L,
.(zone, id2=i.id, i.start, i.end, id1=x.id, x.start, x.end)][,
.(start=max(x.start, i.start), end=min(x.end, i.end)),
.(zone, id1, id2, i.start)][,
i.start := NULL][]
# zone id1 id2 start end
#1: A 1 2 0 3
#2: A 1 2 5 6
#3: B 1 2 6 7
#4: C 1 2 8 10
For the other output data.frames, you can perform a non-equi join first with the previous result and then for each interval find the sub-intervals where the other partner is not around:
rangeDiff <- function(DT) {
DT[,
if (is.na(x.start[1L])) {
.(start=i.start, end=i.end)
} else {
.(start=c(i.start, x.end+1L),
end=c(x.start-1L, i.end))
},
.(zone, id, i.start, i.end)][
start<=end][,
c("i.start","i.end") := NULL][]
} #rangeDiff
rangeDiff(ovlap[df[id==1L], on=.(zone, id1=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id V1 V2
#1: A 1 4 4
#2: A 1 7 8
#3: B 1 10 11
rangeDiff(ovlap[df[id==2L], on=.(zone, id2=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id V1 V2
#1: B 2 3 5
#2: C 2 7 7
#3: C 2 11 11
There is some inconsistency in OP where the bounds of intervals are inclusive or exclusive. I have used inclusive when both ids are around in the same zone (i.e. in the first output data.frame).
Edit: show output for df2
ovlap
# zone id1 id2 start end
#1: A 1 2 0 3
#2: A 1 2 35 37
#3: B 1 2 5 6
#4: B 1 2 20 22
#5: D 1 2 9 10
#6: D 1 2 42 45
other required data.frames:
rangeDiff(ovlap[df[id==1L], on=.(zone, id1=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id start end
# 1: A 1 4 5
# 2: A 1 6 8
# 3: A 1 22 26
# 4: A 1 29 34
# 5: B 1 13 15
# 6: B 1 26 29
# 7: C 1 10 13
# 8: C 1 15 20
# 9: C 1 40 42
# 10: D 1 8 8
# 11: D 1 37 40
rangeDiff(ovlap[df[id==2L], on=.(zone, id2=id, start<end, end>start),
.(zone, id, i.start, i.end, x.start, x.end)])
# zone id start end
# 1: A 2 15 20
# 2: A 2 38 40
# 3: B 2 3 4
# 4: B 2 23 25
# 5: B 2 33 35
# 6: C 2 6 9
# 7: C 2 25 33
# 8: D 2 11 15
# 9: D 2 40 41
df2 sorted by zone for easier checking:
start id zone end
1: 0 1 A 5
2: 6 1 A 8
3: 22 1 A 26
4: 29 1 A 37
5: 0 2 A 3
6: 15 2 A 20
7: 35 2 A 40
8: 5 1 B 6
9: 13 1 B 15
10: 20 1 B 22
11: 26 1 B 29
12: 3 2 B 6
13: 20 2 B 25
14: 33 2 B 35
15: 10 1 C 13
16: 15 1 C 20
17: 40 1 C 42
18: 6 2 C 9
19: 25 2 C 33
20: 8 1 D 10
21: 37 1 D 40
22: 42 1 D 45
23: 9 2 D 15
24: 40 2 D 45
Upvotes: 3
Reputation: 66819
This seems to work, filtering the foverlaps output:
DT = data.table(df)
setkey(DT, start, end)
oDT0 = foverlaps(DT[id==1], DT[id==2])
oDT0[, `:=`(
ostart = pmax(start, i.start),
oend = pmin(end, i.end)
)]
oDT = oDT0[ostart < oend]
# together
oDT[zone == i.zone, .(ids = '1-2', zone, ostart, oend)]
# ids zone ostart oend
# 1: 1-2 A 0 3
# 2: 1-2 A 5 6
# 3: 1-2 B 6 7
# 4: 1-2 C 8 10
# apart
oDT[zone != i.zone, .(id, zone, i.id, i.zone, ostart, oend)]
# id zone i.id i.zone ostart oend
# 1: 2 B 1 A 3 5
# 2: 2 C 1 A 7 8
# 3: 2 C 1 B 10 11
Repeating for new input... not sure if it's correct since no expected output was provided:
> DT = data.table(df2)
> ...
> oDT[zone == i.zone, .(ids = '1-2', zone, ostart, oend)]
ids zone ostart oend
1: 1-2 A 0 3
2: 1-2 B 5 6
3: 1-2 D 9 10
4: 1-2 B 20 22
5: 1-2 A 35 37
6: 1-2 D 42 45
> oDT[zone != i.zone, .(id, zone, i.id, i.zone, ostart, oend)]
id zone i.id i.zone ostart oend
1: 2 B 1 A 3 5
2: 2 C 1 A 6 8
3: 2 C 1 D 8 9
4: 2 D 1 C 10 13
5: 2 D 1 B 13 15
6: 2 A 1 C 15 20
7: 2 B 1 A 22 25
8: 2 C 1 A 25 26
9: 2 C 1 B 26 29
10: 2 C 1 A 29 33
11: 2 B 1 A 33 35
12: 2 A 1 D 37 40
13: 2 D 1 C 40 42
I suspect there is a way to pass arguments to foverlaps
to avoid needing to define and filter by ostart
and oend
. As of the latest CRAN version of the package, the doc indicates that minoverlap
is not yet implemented, so maybe it is necessary for now.
Upvotes: 4
Reputation: 101034
I think you are almost there. You can try the code below by defining a function f
f <- function(A) {
Asp <- split(A, by = "id")
u <- na.omit(foverlaps(Asp[[1]], setkey(Asp[[2]], start, end)))
r <- c()
for (k in 1:nrow(u)) {
if (u[k, end - start < i.end - i.start]) {
p <- u[k, .(start, end)]
} else {
p <- u[k, .(start = i.start, end = i.end)]
}
r[[k]] <- p
}
cbind(
zone = u[, zone],
rbindlist(r),
id = paste0(unique(A[, id]), collapse = "-")
)
}
and then run
rbindlist(Map(f, split(setDT(df), by = "zone")))
which gives
> rbindlist(Map(f, split(setDT(df), by = "zone")))
zone start end id
1: A 0 3 1-2
2: A 5 6 1-2
3: B 6 7 1-2
4: C 8 10 1-2
Upvotes: 3