Reputation: 4692
I have a data frame containing two columns: a grouping variable and a interval period over which the grouping variable holds. I have another data frame with a date column and a value column. How can I join these two tables together somewhat efficiently with dplyr+tidyverse functions?
library(dplyr)
library(lubridate)
ty <- data_frame(date = mdy(paste(1, 1 + seq(20), 2017, sep = "/")),
y = c(rnorm(7), rnorm(7, mean = 2), rnorm(6, mean = -1)))
gy <- data_frame(period = interval(mdy(c("01/01/2017", "01/08/2017", "01/15/2017")),
mdy(c("01/07/2017", "01/14/2017", "01/20/2017"))),
batch = c(1, 2, 3))
I want to build the table that is equivalent to:
ty %>% mutate(batch = c(rep(1, 7), rep(2, 7), rep(3, 6)))
Ideally, this should work reasonably quickly on data sets of up to 1,000,000 rows. Better still if it works on 100,000,000 :).
Upvotes: 1
Views: 1350
Reputation: 1422
How about:
ty %>%
mutate(batch = case_when(
ty$date %within% gy$period[1] ~gy$batch[1],
ty$date %within% gy$period[2] ~gy$batch[2],
ty$date %within% gy$period[3] ~gy$batch[3]))
You would obviously need to define the case_when intervals. How many have you got? I've used cat
and paste0
with good effect for that in the past.
Edited to reflect OP's comments. This should take care of the NSE and would allow the generation of the case_when intervals programatically:
ty %>%
mutate(batch = eval(parse(text = paste0("case_when(",
paste(
paste0(
"ty$date %within% gy$period[",
seq_along(gy$period),
"] ~gy$batch[",
seq_along(gy$period),
"]"
),
collapse = ", "
), ")"))))
Upvotes: 1
Reputation: 4692
This is the best I could come up with so far:
ty$batch <- unlist(lapply(ty$date, function(d) gy$batch[which(d %within% gy$period)]), recursive = FALSE, use.names = FALSE)
But it doesn't seem very fast.
Upvotes: 0