Conditional aggregation based on groups in a data frame R

Question

Let

Data_Frame <- data.frame(Col1 = c("A1", "A1", "A1", "A2", "A2", "A2", "A3", "A3", "A3"),
    
                         Col2 = c("2011-03-11", "2014-08-21", "2016-01-17", "2017-06-30", "2018-07-11", "2018-11-28", "2019-09-04", "2020-02-29", "2020-07-12"),
                  
                         Col3 = c("2018-10-22", "2019-05-24", "2020-12-25", "2018-10-12", "2019-09-24", "2020-12-19", "2018-10-22", "2019-06-14", "2020-12-20"),
              
                         Col4 = c(4, 2, 2, 1, 4, 4, 4, 4, 4),
             
                         Col5 = c(7, 6, 3, 1, 3, 2, 5, 1, 2))

Data_Frame$Col2 <- as.Date(Data_Frame$Col2)
Data_Frame$Col3 <- as.Date(Data_Frame$Col3)
Data_Frame$Col1 <- as.factor(Data_Frame$Col1)

Data_Frame <- Data_Frame %>% group_by(Col1) %>% mutate(Col6 = lubridate::time_length(lubridate::interval(Col2, max(Col3)), "years"))

Data_Frame <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Col7 = ifelse(Col6 <= 1, 1, ifelse(Col6 >1 & Col6 <=2, 2, ifelse(Col6 >2 & Col6 <=5, 5, ifelse(Col6 >5 & Col6 <=10, 10, 11)))))

Data_Frame <- as.data.frame(Data_Frame)

be the dataframe where Col6 represents the time difference between Col2 and Col3 with elements of Col2 subtracted from the biggest date element from Col3 in each group A1 to A3 in Col1, and Col7 represents which elements in Col6 is <=1, <=2, <=5 and <=10.

There are problems in additional columns generated with different conditions.

Generation of Last1Col7 to Last10Col7:

New columns Last1Col7 to Last10Col7 are created based on Col7 and groups A1 to A3 in Col7 such that

Last1Col7 represents how many elements (number of rows) in Col7 are <=1 in each group,
Last2Col7 corresponds to number of rows <=2 in each group,
Last5Col7 corresponds to number of rows <=5 in each group and so on.

The expected result is:
The following code:

Data_Frame1 <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Last1Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 1, ]),
                                                               
                                                               Last2Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 2, ]),
                                                               
                                                               Last5Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 5, ]),
                                                               
                                                               Last10Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 10, ]))

leads to a totally different result:

Generation of Last1SumCol4Col7 to Last10SumCol4Col7:
- Last1SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=1 in each group A1 to A3 in Col1,
- Last2SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=2 in each group A1 to A3 in Col1,
- Last5SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=5 in each group A1 to A3 in Col1,
- Last10SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=10 in each group A1 to A3 in Col1

The expected result is:

Using the following code:

Data_Frame1 <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Last1SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=1, ]$Col4),
                                                              
                                                              Last2SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=2, ]$Col4),
                                                              
                                                              Last5SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=5, ]$Col4),
                                                              
                                                              Last10SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=10, ]$Col4))

the result is:

Starting with columns where all the initial entries of Last1Col7 to Last10Col7 and Last1SumCol4Col7 to Last10SumCol4Col7 are zeros and then using the codes above doesn't help either. What is it that is fundamentally going wrong in the codes under 1 and 3?

akrun · Accepted Answer

We could use map to loop over the values used in comparison, then grouped by 'Col1', create the two columns within each loop, by taking the sum of 'Col7' which is less than or equal to the value looped, and the sum of corresponding values of 'Col4', where 'Col7' is less than or equal to the value

library(purrr)
library(dplyr)
map_dfc(c(1, 2, 5, 10), ~ Data_Frame %>%
     group_by(Col1) %>% 
     transmute(!! sprintf("Last%dCol7", .x) := sum(Col7 <= .x),
               !! sprintf("Last%dSumCol4Col7", .x) := sum(Col4[Col7<= .x])) %>% 
     ungroup %>%
     select(-Col1)) %>% 
 bind_cols(Data_Frame, .)

-output

#Col1       Col2       Col3 Col4 Col5      Col6 Col7 Last1Col7 Last1SumCol4Col7 Last2Col7 Last2SumCol4Col7 Last5Col7 Last5SumCol4Col7 Last10Col7
#1   A1 2011-03-11 2018-10-22    4    7 9.7917808   10         0                0         0                0         1                2          3
#2   A1 2014-08-21 2019-05-24    2    6 6.3452055   10         0                0         0                0         1                2          3
#3   A1 2016-01-17 2020-12-25    2    3 4.9371585    5         0                0         0                0         1                2          3
#4   A2 2017-06-30 2018-10-12    1    1 3.4712329    5         0                0         0                0         3                9          3
#5   A2 2018-07-11 2019-09-24    4    3 2.4410959    5         0                0         0                0         3                9          3
#6   A2 2018-11-28 2020-12-19    4    2 2.0575342    5         0                0         0                0         3                9          3
#7   A3 2019-09-04 2018-10-22    4    5 1.2931507    2         2                8         3               12         3               12          3
#8   A3 2020-02-29 2019-06-14    4    1 0.8060109    1         2                8         3               12         3               12          3
#9   A3 2020-07-12 2020-12-20    4    2 0.4410959    1         2                8         3               12         3               12          3
#  Last10SumCol4Col7
#1                 8
#2                 8
#3                 8
#4                 9
#5                 9
#6                 9
#7                12
#8                12
#9                12

The issue in OP's code giving wrong sum is because Data_Frame[Data_Frame$Col7 <=2, ] is breaking the group and is getting the whole column subset instead of those within the group. Within tidyverse, we don't need to Data_Frame$, if we need to specify the data, use . or cur_data(). Also, here we just need Col7 <=2

Conditional aggregation based on groups in a data frame R

Answers (2)

Related Questions