Dfeld
Dfeld

Reputation: 197

Check if values are within a certain range of a separate dataframe

I have the following two dataframes:

df <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
             dates = as.POSIXct(c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00")))

events <- data.frame(id = c("AED","CFR","DRR","DRR","UN"),
                 start = as.POSIXct(c("2018-05-17 10:00:00","2018-05-17 10:18:00","2018-05-17 11:18:00","2018-05-17 13:10:00","2018-05-17 14:18:00")),
                 end = as.POSIXct(c("2018-05-17 11:56:00","2018-05-17 12:23:00","2018-05-17 12:01:00","2018-05-17 14:18:00",NA)))

By unique id, I want to compare each date in df against the respective date ranges listed in the events dataframe (each row of the events dataframe is considered its own time range), so that I get the following result:

result <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
                 dates = c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00"),
                 inRange = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE),
                 outsideRange = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))

If an id from df is not in events then return FALSE for both inRange and outsideRange; if a df date is greater then the event$start, but events$end is NA, then inRange should be TRUE

I'm hoping to apply the solution to a much larger dataset of at least 500,000 rows.

Upvotes: 1

Views: 883

Answers (3)

Martin Morgan
Martin Morgan

Reputation: 46886

If events do not overlap, then sort the start and end coordinates and use findInterval() to determined those dates that are in the odd-numbered intervals

x = with(events, sort(c(start, end)))
df$inRange = findInterval(df$dates, x) %% 2 == 1

If events do overlap, then create a vector of all events, figure out how to place them in order, and do so

times <- with(events, c(start, end))
o <- order(times)
times <- times[o]

create an event vector that is 1 when a start occurs, -1 when an end occurs, and place these events in order

event <- rep(c(1, -1), each = nrow(events))[o]

calculate the 'coverage', the number of events that are currently in effect.

cvg <- cumsum(event)

Finally, create an updated events data frame where starts and ends are derived from the 'start' values where coverage is 1 and the event is a 'start' event, and likewise for ends

times[ (event == 1 & cvg == 1) | (event == -1 & cvg == 0) ]

and proceed as above.

Putting this together we have

reduce_int <- function(start, end) {
    x <- c(start, end)
    o <- order(x)
    x <- x[o]

    event <- rep(c(1, -1), each = nrow(events))[o]
    cvg <- cumsum(event)

    x[ (event == 1 & cvg == 1) | (event == -1 & cvg == 0) ]
}

overlaps <- function(x, events) {
    vec <- reduce_int(event$start, event$end)
    findInterval(x, vec) %% 2 == 1
}

with use

df$inRange <- overlaps(df$dates, events)

Upvotes: 1

MKR
MKR

Reputation: 20095

One option is to use non-equi update join using data.table. Join df and events on dates>=start and dates<=end. Set the inRange column as TRUE for matching records.

library(data.table)

setDT(df)
setDT(events)

df[events, on=c("dates>=start", "dates<=end"), inRange := TRUE]
df
#                  dates inRange
# 1: 2018-05-17 09:52:00      NA
# 2: 2018-05-17 09:56:00      NA
# 3: 2018-05-17 10:38:00    TRUE
# 4: 2018-05-17 11:29:00    TRUE
# 5: 2018-05-17 12:12:00      NA
# 6: 2018-05-17 13:20:00      NA
# 7: 2018-05-17 14:28:00    TRUE
# 8: 2018-05-17 15:59:00      NA
# 

Upvotes: 1

moodymudskipper
moodymudskipper

Reputation: 47350

in base R:

df2 <- merge(df,events)
df2 <- within(df2, inRange <- dates > start & dates < end)
df2 <- aggregate(inRange ~ dates,df2,any)
#                 dates inRange
# 1 2018-05-17 09:52:00   FALSE
# 2 2018-05-17 09:56:00   FALSE
# 3 2018-05-17 10:38:00    TRUE
# 4 2018-05-17 11:29:00    TRUE
# 5 2018-05-17 12:12:00   FALSE
# 6 2018-05-17 13:20:00   FALSE
# 7 2018-05-17 14:28:00    TRUE
# 8 2018-05-17 15:59:00   FALSE

The first merge is a cartesian product here, if your data is big we might be better off extracting the day first on both sides then merging.

That means doing this before the above code:

df$year <- as.Date(df$dates)
events$year <- as.Date(events$start) # assuming start and end are always on same day

Upvotes: 1

Related Questions