Reputation: 197
I have the following two dataframes:
df <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
dates = as.POSIXct(c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00")))
events <- data.frame(id = c("AED","CFR","DRR","DRR","UN"),
start = as.POSIXct(c("2018-05-17 10:00:00","2018-05-17 10:18:00","2018-05-17 11:18:00","2018-05-17 13:10:00","2018-05-17 14:18:00")),
end = as.POSIXct(c("2018-05-17 11:56:00","2018-05-17 12:23:00","2018-05-17 12:01:00","2018-05-17 14:18:00",NA)))
By unique id, I want to compare each date in df against the respective date ranges listed in the events dataframe (each row of the events dataframe is considered its own time range), so that I get the following result:
result <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
dates = c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00"),
inRange = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE),
outsideRange = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))
If an id from df is not in events then return FALSE for both inRange and outsideRange; if a df date is greater then the event$start, but events$end is NA, then inRange should be TRUE
I'm hoping to apply the solution to a much larger dataset of at least 500,000 rows.
Upvotes: 1
Views: 883
Reputation: 46886
If events
do not overlap, then sort the start and end coordinates and use findInterval()
to determined those dates that are in the odd-numbered intervals
x = with(events, sort(c(start, end)))
df$inRange = findInterval(df$dates, x) %% 2 == 1
If events
do overlap, then create a vector of all events, figure out how to place them in order, and do so
times <- with(events, c(start, end))
o <- order(times)
times <- times[o]
create an event
vector that is 1
when a start occurs, -1
when an end occurs, and place these events in order
event <- rep(c(1, -1), each = nrow(events))[o]
calculate the 'coverage', the number of events that are currently in effect.
cvg <- cumsum(event)
Finally, create an updated events
data frame where starts and ends are derived from the 'start' values where coverage is 1 and the event is a 'start' event, and likewise for ends
times[ (event == 1 & cvg == 1) | (event == -1 & cvg == 0) ]
and proceed as above.
Putting this together we have
reduce_int <- function(start, end) {
x <- c(start, end)
o <- order(x)
x <- x[o]
event <- rep(c(1, -1), each = nrow(events))[o]
cvg <- cumsum(event)
x[ (event == 1 & cvg == 1) | (event == -1 & cvg == 0) ]
}
overlaps <- function(x, events) {
vec <- reduce_int(event$start, event$end)
findInterval(x, vec) %% 2 == 1
}
with use
df$inRange <- overlaps(df$dates, events)
Upvotes: 1
Reputation: 20095
One option is to use non-equi
update join using data.table
. Join df
and events
on dates>=start
and dates<=end
. Set the inRange
column as TRUE
for matching records.
library(data.table)
setDT(df)
setDT(events)
df[events, on=c("dates>=start", "dates<=end"), inRange := TRUE]
df
# dates inRange
# 1: 2018-05-17 09:52:00 NA
# 2: 2018-05-17 09:56:00 NA
# 3: 2018-05-17 10:38:00 TRUE
# 4: 2018-05-17 11:29:00 TRUE
# 5: 2018-05-17 12:12:00 NA
# 6: 2018-05-17 13:20:00 NA
# 7: 2018-05-17 14:28:00 TRUE
# 8: 2018-05-17 15:59:00 NA
#
Upvotes: 1
Reputation: 47350
in base R:
df2 <- merge(df,events)
df2 <- within(df2, inRange <- dates > start & dates < end)
df2 <- aggregate(inRange ~ dates,df2,any)
# dates inRange
# 1 2018-05-17 09:52:00 FALSE
# 2 2018-05-17 09:56:00 FALSE
# 3 2018-05-17 10:38:00 TRUE
# 4 2018-05-17 11:29:00 TRUE
# 5 2018-05-17 12:12:00 FALSE
# 6 2018-05-17 13:20:00 FALSE
# 7 2018-05-17 14:28:00 TRUE
# 8 2018-05-17 15:59:00 FALSE
The first merge is a cartesian product here, if your data is big we might be better off extracting the day first on both sides then merging.
That means doing this before the above code:
df$year <- as.Date(df$dates)
events$year <- as.Date(events$start) # assuming start and end are always on same day
Upvotes: 1