woshishui
woshishui

Reputation: 2084

R: dplyr group by date range

I'm trying to group a data frame by 3 date ranges based on "2016-04-10" and "2016-04-24".

df <- structure(list(date = structure(c(16803, 16810, 16817, 16824, 
16831, 16838, 16845, 16852, 16859, 16866, 16873, 16880, 16887, 
16894, 16901, 16908, 16915, 16922, 16929, 16936, 16943), class = "Date"), 
    new = c(1507L, 2851L, 3550L, 5329L, 7557L, 5546L, 6264L, 
    7160L, 9468L, 5789L, 5928L, 4642L, 8145L, 4867L, 4846L, 5231L, 
    7137L, 3938L, 3741L, 2937L, 194L), resolved = c(21, 27, 15, 
    16, 56, 2773, 8490, 8748, 9325, 7734, 10264, 6739, 6110, 
    9613, 10314, 10349, 7200, 9637, 10831, 11170, 5666), ost = c(1486, 
    2824, 3535, 5313, 7501, 2773, -2226, -1588, 143, -1945, -4336, 
    -2097, 2035, -4746, -5468, -5118, -63, -5699, -7090, -8233, 
    -5472)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-21L), .Names = c("date", "new", "resolved", "ost"))

Tried the following:

df1 <- df %>% group_by(dr=cut(date,breaks=as.Date(c("2016-04-10","2016-04-24")))) %>%
                summarise(ost = sum(ost))

Which gives the wrong result as below.

        dr    ost
2016-04-10 -10586
        NA -17885

Help appreciated!

Upvotes: 11

Views: 10594

Answers (2)

akrun
akrun

Reputation: 887851

We create a grouping variable 'dr' with cut. The breaks mentioned are the range of the 'date' i.e. the min and max values of the 'date' along with the dates specified by the OP, concatenate it (c), use the option include.lowest and get the sum of 'ost' based on this grouping variable.

df %>%
  group_by(dr = cut(date, breaks = c(range(date), 
            as.Date(c("2016-04-10", "2016-04-24"))), include.lowest=TRUE) ) %>% 
  summarise(ost =sum(ost))
#         dr    ost
#     <fctr>  <dbl>
#1 2016-01-03   8672
#2 2016-04-10 -10586
#3 2016-04-24 -26557

Or another option is findInterval which could be faster compared to cut.

df %>%
  group_by(dr = findInterval(date, as.Date(c("2016-04-10", "2016-04-24")))) %>% 
  summarise(ost = sum(ost))
#     dr    ost
#  <int>  <dbl>
#1     0   8672
#2     1 -10586
#3     2 -26557

NOTE: The OP asked the question about cut and this solution gives that.

Upvotes: 10

Sotos
Sotos

Reputation: 51592

You can create a grouping variable first,

df %>% 
mutate(group = cumsum(grepl('2016-04-10|2016-04-24', date))) %>%
group_by(group) %>% 
summarise(ost = sum(ost))

#Source: local data frame [3 x 2]

#  group    ost
#  (int)  (dbl)
#1     0   8672
#2     1 -10586
#3     2 -26557

Upvotes: 7

Related Questions