Reputation: 14212

Find overlaps in time intervals by group and return subsetted data.frame

Say I have this dataframe, which has two IDs (1/2) with their start and end times in three different zones (A/B/C):

df <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), zone = c("A", 
"B", "A", "C", "B", "A", "B", "A", "B", "C"), start = c(0, 6, 
7, 8, 10, 0, 3, 5, 6, 7), end = c(6, 7, 8, 10, 11, 3, 5, 6, 7, 
11)), row.names = c(NA, -10L), class = "data.frame")

df

   id zone start end
1   1    A     0   6
2   1    B     6   7
3   1    A     7   8
4   1    C     8  10
5   1    B    10  11
6   2    A     0   3
7   2    B     3   5
8   2    A     5   6
9   2    B     6   7
10  2    C     7  11

If we look at each zone, we can visually inspect the times when IDs are in the same zone and when they are not:

split(df,df$zone)

$A
  id zone start end
1  1    A     0   6
3  1    A     7   8
6  2    A     0   3
8  2    A     5   6

$B
  id zone start end
2  1    B     6   7
5  1    B    10  11
7  2    B     3   5
9  2    B     6   7

$C
   id zone start end
4   1    C     8  10
10  2    C     7  11

e.g. Both 1 and 2 are together in zone A from 0-3, and from 5-6, but not at other times.

Desired Output

I want to extract three dataframes.

A dataframe showing the times and zones they are together:

  zone start end  id
1    A     0   3 1-2
2    A     5   6 1-2
3    B     6   7 1-2
4    C     8  10 1-2

2 & 3: Dataframes for times when they are not together:

#id=1
  zone start end
1    A     3   5
2    A     7   8
3    B    10  11

#id=2
  zone start end
1    B     3   5
2    C     7   8
3    C    10  11

I have been trying to work with foverlaps from data.table and the intervals package, but can't seem to work out the correct method.

e.g. Subsetting each zone/id, I can sort of get an output that includes overlaps, but it doesn't seem to be quite the right direction:

A <- split(df,df$zone)$A
Asp <- split(A,A$id)
x <- setDT(Asp[[1]])
y <- setDT(Asp[[2]])

setkey(y, start, end)

foverlaps(x, y, type="any")

   id zone start end i.id i.zone i.start i.end
1:  2    A     0   3    1      A       0     6
2:  2    A     5   6    1      A       0     6
3: NA <NA>    NA  NA    1      A       7     8

Any help greatly appreciated.

EDIT: Extra example dataset that seemed to bring up some issues with current suggested solutions:

df2 <- structure(list(start = c(0, 5, 6, 8, 10, 13, 15, 20, 22, 26, 
       29, 37, 40, 42, 0, 3, 6, 9, 15, 20, 25, 33, 35, 40), id = c(1, 
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 
       2, 2), zone = c("A", "B", "A", "D", "C", "B", "C", "B", "A", 
       "B", "A", "D", "C", "D", "A", "B", "C", "D", "A", "B", "C", "B", 
       "A", "D"), end = c(5, 6, 8, 10, 13, 15, 20, 22, 26, 29, 37, 40, 
       42, 45, 3, 6, 9, 15, 20, 25, 33, 35, 40, 45)), class = c("data.table", "data.frame"), row.names = c(NA, -24L))
          
df2

    start id zone end
 1:     0  1    A   5
 2:     5  1    B   6
 3:     6  1    A   8
 4:     8  1    D  10
 5:    10  1    C  13
 6:    13  1    B  15
 7:    15  1    C  20
 8:    20  1    B  22
 9:    22  1    A  26
10:    26  1    B  29
11:    29  1    A  37
12:    37  1    D  40
13:    40  1    C  42
14:    42  1    D  45
15:     0  2    A   3
16:     3  2    B   6
17:     6  2    C   9
18:     9  2    D  15
19:    15  2    A  20
20:    20  2    B  25
21:    25  2    C  33
22:    33  2    B  35
23:    35  2    A  40
24:    40  2    D  45
    start id zone end

Upvotes: 5

Answers (4)

Anoushiravan R

Reputation: 21938

Updated Solution I have made some modifications to the previous solution so that it works with the newly presented data set df2:

I tried to create all combinations of id == 1 and id == 2 in every zone to try and find their intersects
Then I created a custom function to take a subset of our data set plus a pair of ids to extract their start to end values so that we have two vectors and we can find their intersects easily ` In the end I applied this function to every subset of our data set

library(dplyr)
library(tidyr)
library(purrr)

fn <- function(data, x, y) {
  base::intersect(data %>%
                    filter(row_number() == x) %>%
                    select(start, end) %>%
                    {map2(.$start, .$end, ~ .x:.y)} %>%
                    unlist(), 
                  data %>%
                    filter(row_number() == y) %>%
                    select(start, end) %>%
                    {map2(.$start, .$end, ~ .x:.y)} %>%
                    unlist())
}

Then we apply it on our data set:

split(df2, df2$zone) %>%
  map(~ .x %>% 
        mutate(grp = row_number()) %>%
        {expand.grid(.$grp[.$id == 1], .$grp[.$id == 2])} %>%
        rowwise() %>%
        mutate(insec = list(fn(.x, Var1, Var2))) %>%
        filter(length(insec) != 0) %>%
        unnest(cols = c(insec)) %>%
        group_by(Var1, Var2) %>%
        filter(row_number() == 1 | row_number() == n()) %>%
        filter(n() > 1) %>%
        mutate(id = row_number()) %>%
        pivot_wider(names_from = id, values_from = insec) %>%
        ungroup()) %>%
  keep(~ nrow(.x) != 0) %>%
  imap_dfr(~ .x %>% 
             mutate(zone 
                    = .y) %>%
             select(!starts_with("Var"))) %>%
  relocate(zone) %>%
  rename(start = `1`, end = `2`)

# A tibble: 6 x 3
  zone  start   end
  <chr> <int> <int>
1 A         0     3
2 A        35    37
3 B         5     6
4 B        20    22
5 D         9    10
6 D        42    45

Upvotes: 3

chinsoon12

Reputation: 25225

For the first data.frame, you can also use a non-equi join:

ovlap <- df[df, on=.(zone, id<id, start<end, end>start), nomatch=0L,
    .(zone, id2=i.id, i.start, i.end, id1=x.id, x.start, x.end)][,
        .(start=max(x.start, i.start), end=min(x.end, i.end)), 
        .(zone, id1, id2, i.start)][,
            i.start := NULL][]
#   zone id1 id2 start end
#1:    A   1   2     0   3
#2:    A   1   2     5   6
#3:    B   1   2     6   7
#4:    C   1   2     8  10

For the other output data.frames, you can perform a non-equi join first with the previous result and then for each interval find the sub-intervals where the other partner is not around:

rangeDiff <- function(DT) {
    DT[, 
        if (is.na(x.start[1L])) {
            .(start=i.start, end=i.end)   
        } else {
            .(start=c(i.start, x.end+1L),
                end=c(x.start-1L, i.end))
        }, 
        .(zone, id, i.start, i.end)][
            start<=end][,
                c("i.start","i.end") := NULL][]
} #rangeDiff

rangeDiff(ovlap[df[id==1L], on=.(zone, id1=id, start<end, end>start),
    .(zone, id, i.start, i.end, x.start, x.end)])
#   zone id V1 V2
#1:    A  1  4  4
#2:    A  1  7  8
#3:    B  1 10 11


rangeDiff(ovlap[df[id==2L], on=.(zone, id2=id, start<end, end>start),
    .(zone, id, i.start, i.end, x.start, x.end)])
#   zone id V1 V2
#1:    B  2  3  5
#2:    C  2  7  7
#3:    C  2 11 11

There is some inconsistency in OP where the bounds of intervals are inclusive or exclusive. I have used inclusive when both ids are around in the same zone (i.e. in the first output data.frame).

Edit: show output for df2

ovlap

#   zone id1 id2 start end
#1:    A   1   2     0   3
#2:    A   1   2    35  37
#3:    B   1   2     5   6
#4:    B   1   2    20  22
#5:    D   1   2     9  10
#6:    D   1   2    42  45

other required data.frames:

rangeDiff(ovlap[df[id==1L], on=.(zone, id1=id, start<end, end>start),
    .(zone, id, i.start, i.end, x.start, x.end)])
#     zone id start end
#  1:    A  1     4   5
#  2:    A  1     6   8
#  3:    A  1    22  26
#  4:    A  1    29  34
#  5:    B  1    13  15
#  6:    B  1    26  29
#  7:    C  1    10  13
#  8:    C  1    15  20
#  9:    C  1    40  42
# 10:    D  1     8   8
# 11:    D  1    37  40

rangeDiff(ovlap[df[id==2L], on=.(zone, id2=id, start<end, end>start),
    .(zone, id, i.start, i.end, x.start, x.end)])
#    zone id start end
# 1:    A  2    15  20
# 2:    A  2    38  40
# 3:    B  2     3   4
# 4:    B  2    23  25
# 5:    B  2    33  35
# 6:    C  2     6   9
# 7:    C  2    25  33
# 8:    D  2    11  15
# 9:    D  2    40  41

df2 sorted by zone for easier checking:

    start id zone end
 1:     0  1    A   5
 2:     6  1    A   8
 3:    22  1    A  26
 4:    29  1    A  37
 5:     0  2    A   3
 6:    15  2    A  20
 7:    35  2    A  40
 8:     5  1    B   6
 9:    13  1    B  15
10:    20  1    B  22
11:    26  1    B  29
12:     3  2    B   6
13:    20  2    B  25
14:    33  2    B  35
15:    10  1    C  13
16:    15  1    C  20
17:    40  1    C  42
18:     6  2    C   9
19:    25  2    C  33
20:     8  1    D  10
21:    37  1    D  40
22:    42  1    D  45
23:     9  2    D  15
24:    40  2    D  45

Upvotes: 3

Frank

Reputation: 66819

This seems to work, filtering the foverlaps output:

DT = data.table(df)
setkey(DT, start, end)
oDT0 = foverlaps(DT[id==1], DT[id==2])
oDT0[, `:=`(
  ostart = pmax(start, i.start),
  oend = pmin(end, i.end)
)]
oDT = oDT0[ostart < oend]

# together
oDT[zone == i.zone, .(ids = '1-2', zone, ostart, oend)]
#    ids zone ostart oend
# 1: 1-2    A      0    3
# 2: 1-2    A      5    6
# 3: 1-2    B      6    7
# 4: 1-2    C      8   10

# apart
oDT[zone != i.zone, .(id, zone, i.id, i.zone, ostart, oend)]
#    id zone i.id i.zone ostart oend
# 1:  2    B    1      A      3    5
# 2:  2    C    1      A      7    8
# 3:  2    C    1      B     10   11

Repeating for new input... not sure if it's correct since no expected output was provided:

> DT = data.table(df2)
> ...
> oDT[zone == i.zone, .(ids = '1-2', zone, ostart, oend)]
   ids zone ostart oend
1: 1-2    A      0    3
2: 1-2    B      5    6
3: 1-2    D      9   10
4: 1-2    B     20   22
5: 1-2    A     35   37
6: 1-2    D     42   45
> oDT[zone != i.zone, .(id, zone, i.id, i.zone, ostart, oend)]
    id zone i.id i.zone ostart oend
 1:  2    B    1      A      3    5
 2:  2    C    1      A      6    8
 3:  2    C    1      D      8    9
 4:  2    D    1      C     10   13
 5:  2    D    1      B     13   15
 6:  2    A    1      C     15   20
 7:  2    B    1      A     22   25
 8:  2    C    1      A     25   26
 9:  2    C    1      B     26   29
10:  2    C    1      A     29   33
11:  2    B    1      A     33   35
12:  2    A    1      D     37   40
13:  2    D    1      C     40   42

I suspect there is a way to pass arguments to foverlaps to avoid needing to define and filter by ostart and oend. As of the latest CRAN version of the package, the doc indicates that minoverlap is not yet implemented, so maybe it is necessary for now.

Upvotes: 4

ThomasIsCoding

Reputation: 102605

I think you are almost there. You can try the code below by defining a function f

f <- function(A) {
    Asp <- split(A, by = "id")
    u <- na.omit(foverlaps(Asp[[1]], setkey(Asp[[2]], start, end)))
    r <- c()
    for (k in 1:nrow(u)) {
        if (u[k, end - start < i.end - i.start]) {
            p <- u[k, .(start, end)]
        } else {
            p <- u[k, .(start = i.start, end = i.end)]
        }
        r[[k]] <- p
    }
    cbind(
        zone = u[, zone],
        rbindlist(r),
        id = paste0(unique(A[, id]), collapse = "-")
    )
}

and then run

rbindlist(Map(f, split(setDT(df), by = "zone")))

which gives

> rbindlist(Map(f, split(setDT(df), by = "zone")))
   zone start end  id
1:    A     0   3 1-2
2:    A     5   6 1-2
3:    B     6   7 1-2
4:    C     8  10 1-2

Upvotes: 3

Find overlaps in time intervals by group and return subsetted data.frame

Answers (4)

Related Questions