Reputation: 415
I have data frame like below, and I want to get the summation(value) for each 4 rolling month.
Edit: In the output I have "2018-12". But it's not shown in the input. It's a typo, my actual data contain "2018-12".
I prefer to use dplyr:
group <- c("red","green","red","red","red","green","green","green","red","green","green","green")
Month <- c("2019-01","2019-02","2019-03","2019-03","2019-05","2019-07","2019-07","2019-08","2019-09","2019-10","2019-10","2019-10")
VALUE <- c(10,20,30,40,50,60,70,80,90,100,110,120)
d_f <- data.frame(group,Month,VALUE)
d_f %>%
group_by(group) %>%
summarise(value = sum(value))
Can anyone please help me with how to handle the 4 rolling month? Thanks a lot for your valuable time.
Upvotes: 0
Views: 114
Reputation: 30504
Using lubridate
you can use floor_date
and group your dates by 4 month intervals.
library(tidyverse)
library(lubridate)
d_f %>%
mutate(date = as.Date(paste0(Month, '-01'), format = "%Y-%m-%d")) %>%
arrange(date) %>%
group_by(group, startdategroup = floor_date(date, "4 months")) %>%
summarise(value = sum(VALUE)) %>%
mutate(enddategroup = startdategroup %m+% months(4) - 1)
Output
# A tibble: 6 x 4
# Groups: group [2]
group startdategroup value enddategroup
<fct> <date> <dbl> <date>
1 green 2019-01-01 20 2019-04-30
2 green 2019-05-01 210 2019-08-31
3 green 2019-09-01 330 2019-12-31
4 red 2019-01-01 80 2019-04-30
5 red 2019-05-01 50 2019-08-31
6 red 2019-09-01 90 2019-12-31
Edit: To allow for an "overlap month" (months on the edge of two sequential date intervals), I might take a different approach.
First, I might create a sequence of start and end dates for the intervals (based on minimum and maximum dates in your data frame). The sequence would have date intervals every 4 months.
Then, I would do a fuzzy_left_join
(using >=
and <=
logic) and merge this new data frame with yours. Then a row of data for a single month could be counted twice (once for each of two different intervals).
library(fuzzyjoin)
d_f$date = as.Date(paste0(Month, '-01'), format = "%Y-%m-%d")
d_f2 <- data.frame(date_start = seq.Date(min(d_f$date), max(d_f$date), "4 months"))
d_f2$date_end = date_start %m+% months(4)
d_f %>%
fuzzy_left_join(d_f2,
by = c("date" = "date_start", "date" = "date_end"),
match_fun = list(`>=`, `<=`)) %>%
group_by(group, date_start, date_end) %>%
summarise(value = sum(VALUE))
Output
# A tibble: 6 x 4
# Groups: group, date_start [6]
group date_start date_end value
<fct> <date> <date> <dbl>
1 green 2019-01-01 2019-05-01 20
2 green 2019-05-01 2019-09-01 210
3 green 2019-09-01 2020-01-01 330
4 red 2019-01-01 2019-05-01 130
5 red 2019-05-01 2019-09-01 140
6 red 2019-09-01 2020-01-01 90
Upvotes: 1
Reputation: 6941
One approach is to use the lag
/lead
functions in dplyr
. Something like:
df2 = df %>%
group_by(group) %>%
mutate(prev_value = lag(value, 1, order_by = month),
prev_value2 = lag(value, 2, order_by = month),
prev_value3 = lag(value, 3, order_by = month)) %>%
mutate(avg = (value + prev_value + prev_value2 + prev_value3) / 4)
And then filter away the intervals you are not interested in.
Upvotes: 0