Generating daily rows in a dataframe for Ids with different start dates

Question

I'm working with a dataset for which I need to generate daily rows for each Id.

Currently, I'm doing something like this:

library(tidyverse)

# Minimal dataset
my_data <- tibble(
  id = 1:5,
  start_date = lubridate::date(c("2018-01-03", "2018-4-12", "2018-12-20", 
                                 "2019-03-31", "2019-12-01")))

# A dataframe with dates starting with the minimum 'start_date' from my data 
# until a predefined calculation date
calc_date <- lubridate::date("2020-01-31")
df_temp <- tibble(d = seq(min(my_data$start_date), calc_date, by="1 day"))

# Here I create the dataframe I need:
# For each id in 'my_data', I need a row for each day starting at 'start_date'
df <- my_data %>% 
  mutate(.dummy = 1L) %>% 
  full_join(df_temp %>% mutate(.dummy = 1L), by = '.dummy') %>% 
  select(-.dummy) %>% 
  filter(d >= start_date)

Although this works, I have the feeling that it's super inefficient, because of the full join: In this example, the length of the 'df_temp' dataframe is 759 rows, so first I will create 3795 rows and then I'll perform a filter that will keep only 2196 rows. I know this is a "toy" example, but I'll need to perform this task for a dataframe containing more than 5,000 Ids, so things might become very big.

Is there a way to create only the needed rows, instead of performing this full join? I'd like to use a "dplyr-ish" solution, if possible.

akrun · Accepted Answer

One option is non-equi join with data.table

library(data.table)
setDT(df_temp)
setDT(my_data)[df_temp, .(id, d = d, start_date), 
        on = .(start_date <= d), allow.cartesian = TRUE]

Generating daily rows in a dataframe for Ids with different start dates

Answers (1)

Related Questions