user3022875
user3022875

Reputation: 9018

check for overlapping intervals start and end times

I have this data frame sorted by END TIME:

 df = data.frame(ID= c(1,1,1,1,1,1,1),   NumberInSequence= c(1,2,3,4,5,6,7), 
                 StartTime = as.POSIXct(c("2016-01-15 18:02:11 GMT","2016-01-15 18:10:33 GMT","2016-01-15 18:25:08 GMT",
                                               "2016-01-15 18:33:56 GMT","2016-01-15 18:21:03 GMT","2016-01-15 19:55:09 GMT","2016-01-15 19:57:03 GMT"))  ,
                        EndTime = as.POSIXct(c("2016-01-15 18:02:17 GMT","2016-01-15 18:10:39 GMT","2016-01-15 18:25:14 GMT",
                                               "2016-01-15 18:34:02 GMT","2016-01-15 19:53:17 GMT","2016-01-15 19:56:15 GMT","2016-01-15 19:58:17 GMT"))
                       )

Each row is a time interval with a start time and end time

df

 ID NumberInSequence           StartTime             EndTime
1  1                1 2016-01-15 18:02:11 2016-01-15 18:02:17
2  1                2 2016-01-15 18:10:33 2016-01-15 18:10:39
3  1                3 2016-01-15 18:25:08 2016-01-15 18:25:14
4  1                4 2016-01-15 18:33:56 2016-01-15 18:34:02
5  1                5 2016-01-15 18:21:03 2016-01-15 19:53:17
6  1                6 2016-01-15 19:55:09 2016-01-15 19:56:15
7  1                7 2016-01-15 19:57:03 2016-01-15 19:58:17

Then I use dplyr to add a couple fields that calculate the next start time and the wait time which is the difference between the NextStartTime and EndTime. This creates the "WaitTime" column which works in most cases unless there are overlapping inverals.

   df %>% group_by(ID) %>% 
      mutate(
      NextStartTime = lead(StartTime)[ifelse(lead(NumberInSequence) == (NumberInSequence + 1), TRUE, NA)] ,
      WaitTime = difftime(NextStartTime,EndTime, units = 's')
      #max_s = max(StartTime) #,
     # cum_max_s = as.POSIXct(cummin(as.numeric(StartTime)),origin="1970-01-01")
      )


  ID NumberInSequence           StartTime             EndTime       NextStartTime  WaitTime
1  1                1 2016-01-15 18:02:11 2016-01-15 18:02:17 2016-01-15 18:10:33  496 secs
2  1                2 2016-01-15 18:10:33 2016-01-15 18:10:39 2016-01-15 18:25:08  869 secs
3  1                3 2016-01-15 18:25:08 2016-01-15 18:25:14 2016-01-15 18:33:56  522 secs
4  1                4 2016-01-15 18:33:56 2016-01-15 18:34:02 2016-01-15 18:21:03 -779 secs
5  1                5 2016-01-15 18:21:03 2016-01-15 19:53:17 2016-01-15 19:55:09  112 secs
6  1                6 2016-01-15 19:55:09 2016-01-15 19:56:15 2016-01-15 19:57:03   48 secs
7  1                7 2016-01-15 19:57:03 2016-01-15 19:58:17                <NA>   NA secs

Now I need to add a column called "FLAG" with value being OK or NOT OK where

"OK" means the interval IS NOT enitrely OR partially within another interval either. So intervals with "OK" have no overlap with other intervals.

"NOT OK" means the interval IS either partially OR entirely withing another interval. So intervals with "NOT OK" have overlap with other intervals.

I have the intervals below and what the result of the FLAG column should be with a short description

 StartTime             EndTime              FLAG
2016-01-15 18:02:11 2016-01-15 18:02:17     OK - this interval does not overlap with other intervals
2016-01-15 18:10:33 2016-01-15 18:10:39     OK - this interval does not overlap with other intervals
2016-01-15 18:25:08 2016-01-15 18:25:14     NOT OK - this inerval is within the  18:21:03 start time interval 
2016-01-15 18:33:56 2016-01-15 18:34:02     NOT OK - this inerval is within the  18:21:03 start time interval 
2016-01-15 18:21:03 2016-01-15 19:53:17     NOT OK  - this interval contains other intervals 
2016-01-15 19:55:09 2016-01-15 19:56:15     OK - this interval does not overlap with other intervals
2016-01-15 19:57:03 2016-01-15 19:58:17     OK - this interval does not overlap with other intervals

I was looking at using cummin in or cummax in dplyr.....maybe....

cum_max_s = as.POSIXct(cummin(as.numeric(StartTime)),origin="1970-01-01")

Upvotes: 2

Views: 720

Answers (1)

jazzurro
jazzurro

Reputation: 23574

Here is my attempt for you. I think foverlaps() in the data.table package is our friend for this kind of case. You can find some examples on SO. You want to check them out in order to understand the function. You need to create a dummy data.table including starting and ending time. In your case, you have them. I created dummy with minimum information. Then, you use setkey() and utilize foverlaps().

# Create a dummy dt for hoverlaps.
dummy <- setDT(df2)[, 1:4, with = FALSE]

# Use foverlaps().
setkey(setDT(df2), StartTime, EndTime)
foo <- foverlaps(dummy, setDT(df2), by.x = c("StartTime", "EndTime"))

Now, it is time to clean up the data. For each NumberInSequence, if there are more than 1 overlapping intervals (n > 1), remove a row, which has identical start and ending time (StartTime == i.StartTime & EndTime == i.EndTime). Then, remove duplicated rows for each NumberInSequence. If you have just one row indicating overlaping with another interval, that is enough, right? Finally, if StartTime == i.StartTime & EndTime == i.EndTime is TRUE, that means there is no other interval overlaping with the interval. So, you say OK. Otherwise, NOT OK. If necessary, remove extra columns later.

foo[,.SD[!(StartTime == i.StartTime & EndTime == i.EndTime & .N > 1)],
        by = c("ID","NumberInSequence")][!duplicated(NumberInSequence)][,
            check := ifelse(StartTime == i.StartTime & EndTime == i.EndTime,
                            "OK", "NOT OK")] -> out     
print(out)

#   ID NumberInSequence           StartTime             EndTime       NextStartTime  WaitTime i.ID i.NumberInSequence
#1:  1                1 2016-01-15 18:02:11 2016-01-15 18:02:17 2016-01-15 18:10:33  496 secs    1                  1
#2:  1                2 2016-01-15 18:10:33 2016-01-15 18:10:39 2016-01-15 18:25:08  869 secs    1                  2
#3:  1                5 2016-01-15 18:21:03 2016-01-15 19:53:17 2016-01-15 19:55:09  112 secs    1                  3
#4:  1                3 2016-01-15 18:25:08 2016-01-15 18:25:14 2016-01-15 18:33:56  522 secs    1                  5
#5:  1                4 2016-01-15 18:33:56 2016-01-15 18:34:02 2016-01-15 18:21:03 -779 secs    1                  5
#6:  1                6 2016-01-15 19:55:09 2016-01-15 19:56:15 2016-01-15 19:57:03   48 secs    1                  6
#7:  1                7 2016-01-15 19:57:03 2016-01-15 19:58:17                <NA>   NA secs    1                  7

#           i.StartTime           i.EndTime  check
#1: 2016-01-15 18:02:11 2016-01-15 18:02:17     OK
#2: 2016-01-15 18:10:33 2016-01-15 18:10:39     OK
#3: 2016-01-15 18:25:08 2016-01-15 18:25:14 NOT OK
#4: 2016-01-15 18:21:03 2016-01-15 19:53:17 NOT OK
#5: 2016-01-15 18:21:03 2016-01-15 19:53:17 NOT OK
#6: 2016-01-15 19:55:09 2016-01-15 19:56:15     OK
#7: 2016-01-15 19:57:03 2016-01-15 19:58:17     OK

Upvotes: 2

Related Questions