vicky
vicky

Reputation: 415

R Studio - get the summation for the rolling month

I have data frame like below, and I want to get the summation(value) for each 4 rolling month.

Edit: In the output I have "2018-12". But it's not shown in the input. It's a typo, my actual data contain "2018-12".

enter image description here

I prefer to use dplyr:

group <- c("red","green","red","red","red","green","green","green","red","green","green","green")
Month <- c("2019-01","2019-02","2019-03","2019-03","2019-05","2019-07","2019-07","2019-08","2019-09","2019-10","2019-10","2019-10")
VALUE <- c(10,20,30,40,50,60,70,80,90,100,110,120)
d_f <- data.frame(group,Month,VALUE)

d_f %>%
  group_by(group) %>%
  summarise(value = sum(value))

Can anyone please help me with how to handle the 4 rolling month? Thanks a lot for your valuable time.

Upvotes: 0

Views: 114

Answers (2)

Ben
Ben

Reputation: 30504

Using lubridate you can use floor_date and group your dates by 4 month intervals.

library(tidyverse)
library(lubridate)

d_f %>%
  mutate(date = as.Date(paste0(Month, '-01'), format = "%Y-%m-%d")) %>%
  arrange(date) %>%
  group_by(group, startdategroup = floor_date(date, "4 months")) %>%
  summarise(value = sum(VALUE)) %>%
  mutate(enddategroup = startdategroup %m+% months(4) - 1)

Output

# A tibble: 6 x 4
# Groups:   group [2]
  group startdategroup value enddategroup
  <fct> <date>         <dbl> <date>      
1 green 2019-01-01        20 2019-04-30  
2 green 2019-05-01       210 2019-08-31  
3 green 2019-09-01       330 2019-12-31  
4 red   2019-01-01        80 2019-04-30  
5 red   2019-05-01        50 2019-08-31  
6 red   2019-09-01        90 2019-12-31 

Edit: To allow for an "overlap month" (months on the edge of two sequential date intervals), I might take a different approach.

First, I might create a sequence of start and end dates for the intervals (based on minimum and maximum dates in your data frame). The sequence would have date intervals every 4 months.

Then, I would do a fuzzy_left_join (using >= and <= logic) and merge this new data frame with yours. Then a row of data for a single month could be counted twice (once for each of two different intervals).

library(fuzzyjoin)

d_f$date = as.Date(paste0(Month, '-01'), format = "%Y-%m-%d")

d_f2 <- data.frame(date_start = seq.Date(min(d_f$date), max(d_f$date), "4 months"))
d_f2$date_end = date_start %m+% months(4)

d_f %>%
  fuzzy_left_join(d_f2, 
                  by = c("date" = "date_start", "date" = "date_end"), 
                  match_fun = list(`>=`, `<=`)) %>%
  group_by(group, date_start, date_end) %>%
  summarise(value = sum(VALUE))

Output

# A tibble: 6 x 4
# Groups:   group, date_start [6]
  group date_start date_end   value
  <fct> <date>     <date>     <dbl>
1 green 2019-01-01 2019-05-01    20
2 green 2019-05-01 2019-09-01   210
3 green 2019-09-01 2020-01-01   330
4 red   2019-01-01 2019-05-01   130
5 red   2019-05-01 2019-09-01   140
6 red   2019-09-01 2020-01-01    90

Upvotes: 1

Simon.S.A.
Simon.S.A.

Reputation: 6941

One approach is to use the lag/lead functions in dplyr. Something like:

df2 = df %>%
  group_by(group) %>%
  mutate(prev_value = lag(value, 1, order_by = month),
         prev_value2 = lag(value, 2, order_by = month),
         prev_value3 = lag(value, 3, order_by = month)) %>%
  mutate(avg = (value + prev_value + prev_value2 + prev_value3) / 4)

And then filter away the intervals you are not interested in.

Upvotes: 0

Related Questions