Create a lag/laged variable based on a date rather than row number with dplyr in R

Question

Supouse one has a panel data similar to the following, (e.x.) with two variables "dates" and "sales":

  set.seed(123)
  df <- data.frame(
  "dates" = c(as.Date("2021-01-01")+0:30,
              as.Date("2022-01-01")+0:30),
  "sales" = 0:61+(sample(1:30,62,replace = T)))

Notice:

It is daily data from just January of 2021 and 2022, therefore there is a time gap since data form in between dates is not available.
You want to create (for example) two lagged variables:
- I) one day lag: the "sales" from the previous date
- II) one year lag: the "sales" from the same day one year before
You want to do this in "one-shot", I meant, in a line of code using pipes ("%>%")
The solution must be a generalized one: For example, if one wants later to create a "one month lag" or a "45 days lag", etc...

My attempts so far:

1st Try:

I've been trying to use dplyr's function "lag()", but the problem is that it creates a lag but based on the row number rather than the date in "dates". So it works well but when having time continuos datasets, but not with discontinuous ones:

df <- df %>% 
  mutate(lag_1day = lag(sales,order_by = dates,1))

See that despite that all continuos dates are fine, when it reaches the discontinuity, it assigns the last day by row and not by date, therefore assigning a value instead of a "NA"

> df[30:34,]
        dates sales lag_1day
30 2021-01-30    52       58
31 2021-01-31    34       52
32 2022-01-01    55       34
33 2022-01-02    60       55
34 2022-01-03    36       60

2nd Try:

Another way I've been using is by creating a copy of the first data frame, then adding to "dates" the required amount of days for the lag, and finally merging both data frames. It works perfectly (the outcome is the desired result), but I think is not efficient since is not a "one-shot" solution, and it's creating another data frame.

df <- merge(
  df,(df %>% 
        mutate(dates = dates+1) %>% 
        rename(lag_1day = sales)), 
  by = "dates", all.x = T)


> df[30:34,]
        dates sales lag_1day
30 2021-01-30    59       37
31 2021-01-31    49       59
32 2022-01-01    35       NA
33 2022-01-02    46       35
34 2022-01-03    50       46

Create a lag/laged variable based on a date rather than row number with dplyr in R

Answers (1)

Related Questions