Reputation: 21047
I'm working with a dataset for which I need to generate daily rows for each Id.
Currently, I'm doing something like this:
library(tidyverse)
# Minimal dataset
my_data <- tibble(
id = 1:5,
start_date = lubridate::date(c("2018-01-03", "2018-4-12", "2018-12-20",
"2019-03-31", "2019-12-01")))
# A dataframe with dates starting with the minimum 'start_date' from my data
# until a predefined calculation date
calc_date <- lubridate::date("2020-01-31")
df_temp <- tibble(d = seq(min(my_data$start_date), calc_date, by="1 day"))
# Here I create the dataframe I need:
# For each id in 'my_data', I need a row for each day starting at 'start_date'
df <- my_data %>%
mutate(.dummy = 1L) %>%
full_join(df_temp %>% mutate(.dummy = 1L), by = '.dummy') %>%
select(-.dummy) %>%
filter(d >= start_date)
Although this works, I have the feeling that it's super inefficient, because of the full join: In this example, the length of the 'df_temp' dataframe is 759 rows, so first I will create 3795 rows and then I'll perform a filter that will keep only 2196 rows. I know this is a "toy" example, but I'll need to perform this task for a dataframe containing more than 5,000 Ids, so things might become very big.
Is there a way to create only the needed rows, instead of performing this full join? I'd like to use a "dplyr-ish" solution, if possible.
Upvotes: 1
Views: 59
Reputation: 887118
One option is non-equi join with data.table
library(data.table)
setDT(df_temp)
setDT(my_data)[df_temp, .(id, d = d, start_date),
on = .(start_date <= d), allow.cartesian = TRUE]
Upvotes: 2