Reputation: 367
This is a follow: Find matching intervals in data frame by range of two column values
I have a data frame of time related events.
Using the same sample data as before:
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 0 79 56 ID2
ADAM 2 A 384 407 23 ID2
ADAM 3 B 0 79 79 ID2
ADAM 4 B 505 586 81 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
I have overlapping time period for all the different groupings, but I am now looking to find an accurate total of shared time between all the different names (there will be 20+ in the final df) - still dependent on the sequence they are grouped in to.
For example, using John and Adam's start time of '0' seconds in group A, I know that they overlapped between 0-79 seconds of commonality (the max end point between the two of them that would show up in the overlap function), but their total actual share time is only 19 seconds (from 0-19, when John deactivated).
Another instance would be in sequence C, John is active from 566-611 seconds, and Adam is active from 522-599 seconds, the total shared active time is 33 seconds (from John starting activity at 566 and Adam deactivating at 599).
My desired output would be this style:
"John + Adam": total shared active time
"John - Adam": total active time (John without Adam, excludes time where they are active together)
"Adam - John": total active time (Adam without John, excludes time where they are active together)
And continuing for all permutations of the 20+ names and combos in the data frame
Thanks!
Upvotes: 2
Views: 1428
Reputation: 1703
one approach is the following:
lines <- "Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 0 79 56 ID2
ADAM 2 A 384 407 23 ID2
ADAM 3 B 0 79 79 ID2
ADAM 4 B 505 586 81 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2"
con <- textConnection(lines)
df <- read.delim(con)
close(con)
extract_interval_as_vector <- function(df) {
as.vector(t(subset(df,select=c('start_event','end_event'))))
}
sum_length_of_overlaps <- function(v1,v2) {
id <- rep(c(1,0),c(length(v1),length(v2)))
m <- rbind(id,1-id,c(v1,v2))
m <- m[,order(m[3,])]
idx <- which(cumsum(m[1,]) %% 2 & cumsum(m[2,]) %% 2)
if(length(idx)) sum(sapply(idx,function(i) m[3,i+1]-m[3,i]))
else 0
}
sum_length <- function(v) {
sum(v[seq(2,length(v),2)]-v[seq(1,length(v),2)])
}
all_names <- unique(df$Name)
combs <- combn(all_names,2)
l = list()
for(i in 1:ncol(combs)) {
df.sub1 <- subset(df,Name == combs[1,i])
df.sub2 <- subset(df,Name == combs[2,i])
l1 <- sum_length(extract_interval_as_vector(df.sub1)) #sum(df.sub1$duration)
l2 <- sum_length(extract_interval_as_vector(df.sub2)) #sum(df.sub2$duration)
seqs <- unique(df$Sequence)
overlap <- sum(sapply(seqs,function(s) {
v1 <- extract_interval_as_vector(subset(df.sub1,Sequence == s))
v2 <- extract_interval_as_vector(subset(df.sub2,Sequence == s))
sum_length_of_overlaps(v1,v2)
}))
l[[paste(combs[,i],collapse=" + ")]] = overlap
l[[paste(combs[,i],collapse=" - ")]] = l1 - overlap
l[[paste(rev(combs[,i]),collapse=" - ")]] = l2 - overlap
}
Remarks:
l1
and l2
could be calculated directly from df
(as shown in the comments), but the line ADAM 1 A 0 79 56 ID2
contains a strange duration)sum_length_of_overlaps
works by looking for points that lie in both intervals (which is the case if an odd number of start and end points from both interval lists has been seen in the sorted list). These are the first points of the regions of intersection. Note: sum_length_of_overlaps
will not work correctly if one of the vectors contains overlapping intervals.Example: (of how sum_length_of_overlaps
works)
Consider the intervals for sequence A
:
> subset(df,Sequence=="A")
Name Event.Order Sequence start_event end_event duration Group
1 JOHN 1 A 0 19 19 ID1
2 JOHN 2 A 60 112 52 ID1
3 JOHN 3 A 392 429 37 ID1
7 ADAM 1 A 0 79 56 ID2
8 ADAM 2 A 384 407 23 ID2
Putting only start_event
and end_event
row-wise into separate vectors for JOHN
and ADAM
one obtains
> v.john <- extract_interval_as_vector(subset(df,Sequence == "A" & Name == "JOHN"))
> v.john
[1] 0 19 60 112 392 429
> v.adam <- extract_interval_as_vector(subset(df,Sequence == "A" & Name == "ADAM"))
> v.adam
[1] 0 79 384 407
If one joins these vectors and sorts the resulting vector it is necessary to keep track which point belonged to which interval sequence. Thus it is useful to put this joint vector together with indicator rows into a matrix:
> id <- rep(c(1,0),c(length(v.john),length(v.adam)))
> m <- rbind(id,1-id,c(v.john,v.adam))
> m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
id 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 1 1 1 1
0 19 60 112 392 429 0 79 384 407
After sorting one can still figure out the original group by looking at either the first or the second row:
> m <- m[,order(m[3,])]
> m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
id 1 0 1 1 0 1 0 1 0 1
0 1 0 0 1 0 1 0 1 0
0 0 19 60 79 112 384 392 407 429
Since there is an intersection if and only if the start point of an interval has been seen in each group, but the corresponding end points have not, it is sufficient to count the number of points seen from each group. If the number of points seen from each group is odd the point is the start of an intersection:
> m[1,] <- cumsum(m[1,]) %% 2
> m[2,] <- cumsum(m[2,]) %% 2
> m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
id 1 1 0 1 1 0 0 1 1 0
0 1 1 1 0 0 1 1 0 0
0 0 19 60 79 112 384 392 407 429
Thus one immediately sees that m[3,2]
, m[3,4]
and m[3,8]
are the starting points of the intersections. (cf. also the manual derivation below)
Output:
> l
$`JOHN + ADAM`
[1] 144
$`JOHN - ADAM`
[1] 135
$`ADAM - JOHN`
[1] 260
Manual derivation of JOHN + ADAM
:
A
:
B
: NoneC
:
Total length of intersections = 19 + 19 + 15 + 58 + 33 = 144
Upvotes: 5