Reputation: 61
I have an event table and a set of intervals. I want to associate a boolean with each event, that is set to T is the event is in the interval.
My inputs:
intervals:
begin end
2015-03-01 14:02:00 2015-03-01 14:09:00
2015-03-01 15:13:00 2015-03-01 15:54:00
2015-03-01 16:02:00 2015-03-01 19:09:00
events:
id date
1 2015-03-01 14:01:00
2 2015-03-01 14:03:00
3 2015-03-01 14:07:00
4 2015-03-01 15:55:00
5 2015-03-01 17:02:00
My output would be:
id inInterval
1 F
2 T
3 T
4 F
5 T
I have millions of events and tens of thousands of intervals. For performance reasons I don't want to loop on the events and I don't want to process all events for each interval.
What I can easily do is get on ordered data.table of events and intervals start/end:
events:
id date start end
1 2015-03-01 14:01:00
NA 2015-03-01 14:02:00 T
2 2015-03-01 14:03:00
3 2015-03-01 14:07:00
NA 2015-03-01 14:09:00 T
NA 2015-03-01 15:13:00 T
NA 2015-03-01 15:54:00 T
4 2015-03-01 15:55:00
NA 2015-03-01 16:02:00 T
5 2015-03-01 17:02:00
NA 2015-03-01 19:09:00 T
Now I am stuck on the last part: setting inInterval=T for all lines between a start=T and a end=T. It clearly requires some kind of lag, but I am missing the required feature.
Any help would be appreciated. Thanks
EDIT: It seems that this existing question provides some insights: What is an efficient method for partitioning and aggregating intervals from timestamped rows in a data frame? It was suggested once I published my question. Basically findInterval should help me solve my issue.
Upvotes: 2
Views: 278
Reputation: 2950
With ivs and iv_between()
:
library(ivs)
library(dplyr, warn.conflicts = FALSE)
intervals <- tibble(
begin = c(
"2015-03-01 14:02:00",
"2015-03-01 15:13:00",
"2015-03-01 16:02:00"
),
end = c(
"2015-03-01 14:09:00",
"2015-03-01 15:54:00",
"2015-03-01 19:09:00"
)
)
intervals <- intervals %>%
mutate(
begin = as.POSIXct(begin, tz = "UTC"),
end = as.POSIXct(end, tz = "UTC")
) %>%
mutate(
range = iv(begin, end),
.keep = "unused"
)
events <- tibble(
id = 1:5,
date = c(
"2015-03-01 14:01:00",
"2015-03-01 14:03:00",
"2015-03-01 14:07:00",
"2015-03-01 15:55:00",
"2015-03-01 17:02:00"
)
)
events <- events %>%
mutate(date = as.POSIXct(date, tz = "UTC"))
intervals
#> # A tibble: 3 × 1
#> range
#> <iv<dttm>>
#> 1 [2015-03-01 14:02:00, 2015-03-01 14:09:00)
#> 2 [2015-03-01 15:13:00, 2015-03-01 15:54:00)
#> 3 [2015-03-01 16:02:00, 2015-03-01 19:09:00)
events %>%
mutate(in_interval = iv_between(date, intervals$range))
#> # A tibble: 5 × 3
#> id date in_interval
#> <int> <dttm> <lgl>
#> 1 1 2015-03-01 14:01:00 FALSE
#> 2 2 2015-03-01 14:03:00 TRUE
#> 3 3 2015-03-01 14:07:00 TRUE
#> 4 4 2015-03-01 15:55:00 FALSE
#> 5 5 2015-03-01 17:02:00 TRUE
Upvotes: 0
Reputation: 15784
My newbie idea in base R (given the intervals table is ordered and that there's no overlapping intervals):
pint <- as.POSIXct(as.vector(t(intervals)))
results <- data.frame(
id = events$id,
inInterval = findInterval( as.POSIXct(events$date), pint) %% 2 != 0
)
Gives:
id inInterval
1 1 FALSE
2 2 TRUE
3 3 TRUE
4 4 FALSE
5 5 TRUE
findInterval
return the index in the intervals found:
Upvotes: 1