How to get the difference of a lagged variable by date?

Question

Consider the following example:

library(tidyverse)
library(lubridate)

df = tibble(client_id = rep(1:3, each=24),
            date = rep(seq(ymd("2016-01-01"), (ymd("2016-12-01") + years(1)), by='month'), 3),
            expenditure = runif(72))

In df you have stored information on monthly expenditure from a bunch of clients for the past 2 years. Now you want to calculate the monthly difference between this year and the previous year for each client.

Is there any way of doing this maintaining the "long" format of the dataset? Here I show you the way I am doing it nowadays, which implies going wide:

df2 = df %>% 
  mutate(date2 = paste0('val_',
                        year(date), 
                        formatC(month(date), width=2, flag="0"))) %>% 
  select(client_id, date2, value) %>% 
  pivot_wider(names_from = date2, 
              values_from = value)

df3 = (df2[,2:13] - df2[,14:25])

However I find tihs unnecessary complex, and in large datasets going from long to wide can take quite a lot of time, so I think there must be a better way of doing it.

Ronak Shah · Accepted Answer

If you want to keep data in long format, one way would be to group by month and date value for each client_id and calculate the difference using diff.

library(dplyr)

df %>% 
  group_by(client_id, month_date = format(date, "%m-%d")) %>%
  summarise(diff = -diff(expenditure))

#   client_id month_date  diff
#               
# 1         1 01-01       0.278  
# 2         1 02-01      -0.0421 
# 3         1 03-01       0.0117 
# 4         1 04-01      -0.0440 
# 5         1 05-01       0.855  
# 6         1 06-01       0.354  
# 7         1 07-01      -0.226  
# 8         1 08-01       0.506  
# 9         1 09-01       0.119  
#10         1 10-01       0.00819
# … with 26 more rows

How to get the difference of a lagged variable by date?

Answers (2)

Related Questions