Reputation: 9018
I have this data frame sorted by END TIME:
df = data.frame(ID= c(1,1,1,1,1,1,1), NumberInSequence= c(1,2,3,4,5,6,7),
StartTime = as.POSIXct(c("2016-01-15 18:02:11 GMT","2016-01-15 18:10:33 GMT","2016-01-15 18:25:08 GMT",
"2016-01-15 18:33:56 GMT","2016-01-15 18:21:03 GMT","2016-01-15 19:55:09 GMT","2016-01-15 19:57:03 GMT")) ,
EndTime = as.POSIXct(c("2016-01-15 18:02:17 GMT","2016-01-15 18:10:39 GMT","2016-01-15 18:25:14 GMT",
"2016-01-15 18:34:02 GMT","2016-01-15 19:53:17 GMT","2016-01-15 19:56:15 GMT","2016-01-15 19:58:17 GMT"))
)
Each row is a time interval with a start time and end time
df
ID NumberInSequence StartTime EndTime
1 1 1 2016-01-15 18:02:11 2016-01-15 18:02:17
2 1 2 2016-01-15 18:10:33 2016-01-15 18:10:39
3 1 3 2016-01-15 18:25:08 2016-01-15 18:25:14
4 1 4 2016-01-15 18:33:56 2016-01-15 18:34:02
5 1 5 2016-01-15 18:21:03 2016-01-15 19:53:17
6 1 6 2016-01-15 19:55:09 2016-01-15 19:56:15
7 1 7 2016-01-15 19:57:03 2016-01-15 19:58:17
Then I use dplyr to add a couple fields that calculate the next start time and the wait time which is the difference between the NextStartTime and EndTime. This creates the "WaitTime" column which works in most cases unless there are overlapping inverals.
df %>% group_by(ID) %>%
mutate(
NextStartTime = lead(StartTime)[ifelse(lead(NumberInSequence) == (NumberInSequence + 1), TRUE, NA)] ,
WaitTime = difftime(NextStartTime,EndTime, units = 's')
#max_s = max(StartTime) #,
# cum_max_s = as.POSIXct(cummin(as.numeric(StartTime)),origin="1970-01-01")
)
ID NumberInSequence StartTime EndTime NextStartTime WaitTime
1 1 1 2016-01-15 18:02:11 2016-01-15 18:02:17 2016-01-15 18:10:33 496 secs
2 1 2 2016-01-15 18:10:33 2016-01-15 18:10:39 2016-01-15 18:25:08 869 secs
3 1 3 2016-01-15 18:25:08 2016-01-15 18:25:14 2016-01-15 18:33:56 522 secs
4 1 4 2016-01-15 18:33:56 2016-01-15 18:34:02 2016-01-15 18:21:03 -779 secs
5 1 5 2016-01-15 18:21:03 2016-01-15 19:53:17 2016-01-15 19:55:09 112 secs
6 1 6 2016-01-15 19:55:09 2016-01-15 19:56:15 2016-01-15 19:57:03 48 secs
7 1 7 2016-01-15 19:57:03 2016-01-15 19:58:17 <NA> NA secs
Now I need to add a column called "FLAG" with value being OK or NOT OK where
"OK" means the interval IS NOT enitrely OR partially within another interval either. So intervals with "OK" have no overlap with other intervals.
"NOT OK" means the interval IS either partially OR entirely withing another interval. So intervals with "NOT OK" have overlap with other intervals.
I have the intervals below and what the result of the FLAG column should be with a short description
StartTime EndTime FLAG
2016-01-15 18:02:11 2016-01-15 18:02:17 OK - this interval does not overlap with other intervals
2016-01-15 18:10:33 2016-01-15 18:10:39 OK - this interval does not overlap with other intervals
2016-01-15 18:25:08 2016-01-15 18:25:14 NOT OK - this inerval is within the 18:21:03 start time interval
2016-01-15 18:33:56 2016-01-15 18:34:02 NOT OK - this inerval is within the 18:21:03 start time interval
2016-01-15 18:21:03 2016-01-15 19:53:17 NOT OK - this interval contains other intervals
2016-01-15 19:55:09 2016-01-15 19:56:15 OK - this interval does not overlap with other intervals
2016-01-15 19:57:03 2016-01-15 19:58:17 OK - this interval does not overlap with other intervals
I was looking at using cummin in or cummax in dplyr.....maybe....
cum_max_s = as.POSIXct(cummin(as.numeric(StartTime)),origin="1970-01-01")
Upvotes: 2
Views: 720
Reputation: 23574
Here is my attempt for you. I think foverlaps()
in the data.table package is our friend for this kind of case. You can find some examples on SO. You want to check them out in order to understand the function. You need to create a dummy data.table including starting and ending time. In your case, you have them. I created dummy
with minimum information. Then, you use setkey()
and utilize foverlaps()
.
# Create a dummy dt for hoverlaps.
dummy <- setDT(df2)[, 1:4, with = FALSE]
# Use foverlaps().
setkey(setDT(df2), StartTime, EndTime)
foo <- foverlaps(dummy, setDT(df2), by.x = c("StartTime", "EndTime"))
Now, it is time to clean up the data. For each NumberInSequence
, if there are more than 1 overlapping intervals (n > 1), remove a row, which has identical start and ending time (StartTime == i.StartTime & EndTime == i.EndTime
). Then, remove duplicated rows for each NumberInSequence
. If you have just one row indicating overlaping with another interval, that is enough, right? Finally, if StartTime == i.StartTime & EndTime == i.EndTime
is TRUE
, that means there is no other interval overlaping with the interval. So, you say OK
. Otherwise, NOT OK
. If necessary, remove extra columns later.
foo[,.SD[!(StartTime == i.StartTime & EndTime == i.EndTime & .N > 1)],
by = c("ID","NumberInSequence")][!duplicated(NumberInSequence)][,
check := ifelse(StartTime == i.StartTime & EndTime == i.EndTime,
"OK", "NOT OK")] -> out
print(out)
# ID NumberInSequence StartTime EndTime NextStartTime WaitTime i.ID i.NumberInSequence
#1: 1 1 2016-01-15 18:02:11 2016-01-15 18:02:17 2016-01-15 18:10:33 496 secs 1 1
#2: 1 2 2016-01-15 18:10:33 2016-01-15 18:10:39 2016-01-15 18:25:08 869 secs 1 2
#3: 1 5 2016-01-15 18:21:03 2016-01-15 19:53:17 2016-01-15 19:55:09 112 secs 1 3
#4: 1 3 2016-01-15 18:25:08 2016-01-15 18:25:14 2016-01-15 18:33:56 522 secs 1 5
#5: 1 4 2016-01-15 18:33:56 2016-01-15 18:34:02 2016-01-15 18:21:03 -779 secs 1 5
#6: 1 6 2016-01-15 19:55:09 2016-01-15 19:56:15 2016-01-15 19:57:03 48 secs 1 6
#7: 1 7 2016-01-15 19:57:03 2016-01-15 19:58:17 <NA> NA secs 1 7
# i.StartTime i.EndTime check
#1: 2016-01-15 18:02:11 2016-01-15 18:02:17 OK
#2: 2016-01-15 18:10:33 2016-01-15 18:10:39 OK
#3: 2016-01-15 18:25:08 2016-01-15 18:25:14 NOT OK
#4: 2016-01-15 18:21:03 2016-01-15 19:53:17 NOT OK
#5: 2016-01-15 18:21:03 2016-01-15 19:53:17 NOT OK
#6: 2016-01-15 19:55:09 2016-01-15 19:56:15 OK
#7: 2016-01-15 19:57:03 2016-01-15 19:58:17 OK
Upvotes: 2