wdkrnls
wdkrnls

Reputation: 4692

How to perform a join based on intervals with dplyr?

I have a data frame containing two columns: a grouping variable and a interval period over which the grouping variable holds. I have another data frame with a date column and a value column. How can I join these two tables together somewhat efficiently with dplyr+tidyverse functions?

library(dplyr)
library(lubridate)
ty <- data_frame(date = mdy(paste(1, 1 + seq(20), 2017, sep = "/")), 
                 y = c(rnorm(7), rnorm(7, mean = 2), rnorm(6, mean = -1)))
gy <- data_frame(period = interval(mdy(c("01/01/2017", "01/08/2017", "01/15/2017")), 
                                   mdy(c("01/07/2017", "01/14/2017", "01/20/2017"))), 
                          batch = c(1, 2, 3))

I want to build the table that is equivalent to:

ty %>% mutate(batch = c(rep(1, 7), rep(2, 7), rep(3, 6)))

Ideally, this should work reasonably quickly on data sets of up to 1,000,000 rows. Better still if it works on 100,000,000 :).

Upvotes: 1

Views: 1350

Answers (2)

biomiha
biomiha

Reputation: 1422

How about:

ty %>% 
  mutate(batch = case_when(
  ty$date %within% gy$period[1] ~gy$batch[1],
  ty$date %within% gy$period[2] ~gy$batch[2],
  ty$date %within% gy$period[3] ~gy$batch[3]))

You would obviously need to define the case_when intervals. How many have you got? I've used cat and paste0 with good effect for that in the past.

Edited to reflect OP's comments. This should take care of the NSE and would allow the generation of the case_when intervals programatically:

ty %>%
  mutate(batch = eval(parse(text = paste0("case_when(",
                                      paste(
                                        paste0(
                                          "ty$date %within% gy$period[",
                                          seq_along(gy$period),
                                          "] ~gy$batch[",
                                          seq_along(gy$period),
                                          "]"
                                        ),
                                        collapse = ", "
                                      ), ")"))))

Upvotes: 1

wdkrnls
wdkrnls

Reputation: 4692

This is the best I could come up with so far:

ty$batch <- unlist(lapply(ty$date, function(d) gy$batch[which(d %within% gy$period)]), recursive = FALSE, use.names = FALSE)

But it doesn't seem very fast.

Upvotes: 0

Related Questions