wetcoaster
wetcoaster

Reputation: 367

Sum Overlapping/Non-Overlapping Time Intervals in R

This is a follow: Find matching intervals in data frame by range of two column values

I have a data frame of time related events.

Using the same sample data as before:

Name     Event Order     Sequence     start_event     end_event     duration     Group 
JOHN     1               A               0               19          19           ID1
JOHN     2               A               60              112         52           ID1  
JOHN     3               A               392             429         37           ID1  
JOHN     4               B               282             329         47           ID1
JOHN     5               C               147             226         79           ID1  
JOHN     6               C               566             611         45           ID1  
ADAM     1               A               0               79          56           ID2
ADAM     2               A               384             407         23           ID2  
ADAM     3               B               0               79          79           ID2  
ADAM     4               B               505             586         81           ID2
ADAM     5               C               140             205         65           ID2  
ADAM     6               C               522             599         77           ID2  

I have overlapping time period for all the different groupings, but I am now looking to find an accurate total of shared time between all the different names (there will be 20+ in the final df) - still dependent on the sequence they are grouped in to.

For example, using John and Adam's start time of '0' seconds in group A, I know that they overlapped between 0-79 seconds of commonality (the max end point between the two of them that would show up in the overlap function), but their total actual share time is only 19 seconds (from 0-19, when John deactivated).

Another instance would be in sequence C, John is active from 566-611 seconds, and Adam is active from 522-599 seconds, the total shared active time is 33 seconds (from John starting activity at 566 and Adam deactivating at 599).

My desired output would be this style:

"John + Adam": total shared active time

"John - Adam": total active time (John without Adam, excludes time where they are active together)

"Adam - John": total active time (Adam without John, excludes time where they are active together)

And continuing for all permutations of the 20+ names and combos in the data frame

Thanks!

Upvotes: 2

Views: 1428

Answers (1)

Jonathan von Schroeder
Jonathan von Schroeder

Reputation: 1703

one approach is the following:

lines <- "Name  Event Order Sequence    start_event end_event   duration    Group
JOHN    1   A   0   19  19  ID1
JOHN    2   A   60  112 52  ID1
JOHN    3   A   392 429 37  ID1
JOHN    4   B   282 329 47  ID1
JOHN    5   C   147 226 79  ID1
JOHN    6   C   566 611 45  ID1
ADAM    1   A   0   79  56  ID2
ADAM    2   A   384 407 23  ID2
ADAM    3   B   0   79  79  ID2
ADAM    4   B   505 586 81  ID2
ADAM    5   C   140 205 65  ID2
ADAM    6   C   522 599 77  ID2"

con <- textConnection(lines)
df <- read.delim(con)
close(con)

extract_interval_as_vector <- function(df) {
  as.vector(t(subset(df,select=c('start_event','end_event'))))
}

sum_length_of_overlaps <- function(v1,v2) {
  id <- rep(c(1,0),c(length(v1),length(v2)))
  m <- rbind(id,1-id,c(v1,v2))
  m <- m[,order(m[3,])]
  idx <- which(cumsum(m[1,]) %% 2 & cumsum(m[2,]) %% 2)
  if(length(idx)) sum(sapply(idx,function(i) m[3,i+1]-m[3,i]))
  else 0
}
sum_length <- function(v) {
  sum(v[seq(2,length(v),2)]-v[seq(1,length(v),2)])
}

all_names <- unique(df$Name)
combs <- combn(all_names,2)

l = list()

for(i in 1:ncol(combs)) {
  df.sub1 <- subset(df,Name == combs[1,i])
  df.sub2 <- subset(df,Name == combs[2,i])
  l1 <- sum_length(extract_interval_as_vector(df.sub1)) #sum(df.sub1$duration)
  l2 <- sum_length(extract_interval_as_vector(df.sub2)) #sum(df.sub2$duration)
  seqs <- unique(df$Sequence)
  overlap <- sum(sapply(seqs,function(s) {
    v1 <- extract_interval_as_vector(subset(df.sub1,Sequence == s))
    v2 <- extract_interval_as_vector(subset(df.sub2,Sequence == s))
    sum_length_of_overlaps(v1,v2)
  }))
  l[[paste(combs[,i],collapse=" + ")]] = overlap
  l[[paste(combs[,i],collapse=" - ")]] = l1 - overlap
  l[[paste(rev(combs[,i]),collapse=" - ")]] = l2 - overlap
}

Remarks:

  • l1 and l2 could be calculated directly from df (as shown in the comments), but the line ADAM 1 A 0 79 56 ID2 contains a strange duration)
  • sum_length_of_overlaps works by looking for points that lie in both intervals (which is the case if an odd number of start and end points from both interval lists has been seen in the sorted list). These are the first points of the regions of intersection. Note: sum_length_of_overlaps will not work correctly if one of the vectors contains overlapping intervals.

Example: (of how sum_length_of_overlaps works)

Consider the intervals for sequence A:

> subset(df,Sequence=="A")
  Name Event.Order Sequence start_event end_event duration Group
1 JOHN           1        A           0        19       19   ID1
2 JOHN           2        A          60       112       52   ID1
3 JOHN           3        A         392       429       37   ID1
7 ADAM           1        A           0        79       56   ID2
8 ADAM           2        A         384       407       23   ID2

Putting only start_event and end_event row-wise into separate vectors for JOHN and ADAM one obtains

> v.john <- extract_interval_as_vector(subset(df,Sequence == "A" & Name == "JOHN"))
> v.john
[1]   0  19  60 112 392 429
> v.adam <- extract_interval_as_vector(subset(df,Sequence == "A" & Name == "ADAM"))
> v.adam
[1]   0  79 384 407

If one joins these vectors and sorts the resulting vector it is necessary to keep track which point belonged to which interval sequence. Thus it is useful to put this joint vector together with indicator rows into a matrix:

> id <- rep(c(1,0),c(length(v.john),length(v.adam)))
> m <- rbind(id,1-id,c(v.john,v.adam))
> m
   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
id    1    1    1    1    1    1    0    0    0     0
      0    0    0    0    0    0    1    1    1     1
      0   19   60  112  392  429    0   79  384   407

After sorting one can still figure out the original group by looking at either the first or the second row:

> m <- m[,order(m[3,])]
> m
   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
id    1    0    1    1    0    1    0    1    0     1
      0    1    0    0    1    0    1    0    1     0
      0    0   19   60   79  112  384  392  407   429

Since there is an intersection if and only if the start point of an interval has been seen in each group, but the corresponding end points have not, it is sufficient to count the number of points seen from each group. If the number of points seen from each group is odd the point is the start of an intersection:

> m[1,] <- cumsum(m[1,]) %% 2
> m[2,] <- cumsum(m[2,]) %% 2
> m
   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
id    1    1    0    1    1    0    0    1    1     0
      0    1    1    1    0    0    1    1    0     0
      0    0   19   60   79  112  384  392  407   429

Thus one immediately sees that m[3,2], m[3,4] and m[3,8] are the starting points of the intersections. (cf. also the manual derivation below)

Output:

> l
$`JOHN + ADAM`
[1] 144

$`JOHN - ADAM`
[1] 135

$`ADAM - JOHN`
[1] 260

Manual derivation of JOHN + ADAM:

  1. Intersections in Sequence A:
    • [0,19], [0,79] => Length 19
    • [60,112], [0,79] => Length 19
    • [392,429], [384,407] => Length 15
  2. Intersections in Sequence B: None
  3. Intersections in Sequence C:
    • [147,226], [140,205] => Length 58
    • [566,611], [522,599] => Length 33

Total length of intersections = 19 + 19 + 15 + 58 + 33 = 144

Upvotes: 5

Related Questions