Ray
Ray

Reputation: 351

Conditional aggregation based on groups in a data frame R

Let

Data_Frame <- data.frame(Col1 = c("A1", "A1", "A1", "A2", "A2", "A2", "A3", "A3", "A3"),
    
                         Col2 = c("2011-03-11", "2014-08-21", "2016-01-17", "2017-06-30", "2018-07-11", "2018-11-28", "2019-09-04", "2020-02-29", "2020-07-12"),
                  
                         Col3 = c("2018-10-22", "2019-05-24", "2020-12-25", "2018-10-12", "2019-09-24", "2020-12-19", "2018-10-22", "2019-06-14", "2020-12-20"),
              
                         Col4 = c(4, 2, 2, 1, 4, 4, 4, 4, 4),
             
                         Col5 = c(7, 6, 3, 1, 3, 2, 5, 1, 2))

Data_Frame$Col2 <- as.Date(Data_Frame$Col2)
Data_Frame$Col3 <- as.Date(Data_Frame$Col3)
Data_Frame$Col1 <- as.factor(Data_Frame$Col1)

Data_Frame <- Data_Frame %>% group_by(Col1) %>% mutate(Col6 = lubridate::time_length(lubridate::interval(Col2, max(Col3)), "years"))

Data_Frame <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Col7 = ifelse(Col6 <= 1, 1, ifelse(Col6 >1 & Col6 <=2, 2, ifelse(Col6 >2 & Col6 <=5, 5, ifelse(Col6 >5 & Col6 <=10, 10, 11)))))

Data_Frame <- as.data.frame(Data_Frame)

be the dataframe where Col6 represents the time difference between Col2 and Col3 with elements of Col2 subtracted from the biggest date element from Col3 in each group A1 to A3 in Col1, and Col7 represents which elements in Col6 is <=1, <=2, <=5 and <=10.

There are problems in additional columns generated with different conditions.

  1. Generation of Last1Col7 to Last10Col7:

New columns Last1Col7 to Last10Col7 are created based on Col7 and groups A1 to A3 in Col7 such that

The expected result is: enter image description here
The following code:

Data_Frame1 <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Last1Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 1, ]),
                                                               
                                                               Last2Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 2, ]),
                                                               
                                                               Last5Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 5, ]),
                                                               
                                                               Last10Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 10, ]))

leads to a totally different result: enter image description here

  1. Generation of Last1SumCol4Col7 to Last10SumCol4Col7:

    • Last1SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=1 in each group A1 to A3 in Col1,

    • Last2SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=2 in each group A1 to A3 in Col1,

    • Last5SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=5 in each group A1 to A3 in Col1,

    • Last10SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=10 in each group A1 to A3 in Col1

The expected result is: enter image description here

Using the following code:

Data_Frame1 <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Last1SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=1, ]$Col4),
                                                              
                                                              Last2SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=2, ]$Col4),
                                                              
                                                              Last5SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=5, ]$Col4),
                                                              
                                                              Last10SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=10, ]$Col4))

the result is: enter image description here

Starting with columns where all the initial entries of Last1Col7 to Last10Col7 and Last1SumCol4Col7 to Last10SumCol4Col7 are zeros and then using the codes above doesn't help either. What is it that is fundamentally going wrong in the codes under 1 and 3?

Upvotes: 4

Views: 245

Answers (2)

akrun
akrun

Reputation: 886938

We could use map to loop over the values used in comparison, then grouped by 'Col1', create the two columns within each loop, by taking the sum of 'Col7' which is less than or equal to the value looped, and the sum of corresponding values of 'Col4', where 'Col7' is less than or equal to the value

library(purrr)
library(dplyr)
map_dfc(c(1, 2, 5, 10), ~ Data_Frame %>%
     group_by(Col1) %>% 
     transmute(!! sprintf("Last%dCol7", .x) := sum(Col7 <= .x),
               !! sprintf("Last%dSumCol4Col7", .x) := sum(Col4[Col7<= .x])) %>% 
     ungroup %>%
     select(-Col1)) %>% 
 bind_cols(Data_Frame, .)

-output

#Col1       Col2       Col3 Col4 Col5      Col6 Col7 Last1Col7 Last1SumCol4Col7 Last2Col7 Last2SumCol4Col7 Last5Col7 Last5SumCol4Col7 Last10Col7
#1   A1 2011-03-11 2018-10-22    4    7 9.7917808   10         0                0         0                0         1                2          3
#2   A1 2014-08-21 2019-05-24    2    6 6.3452055   10         0                0         0                0         1                2          3
#3   A1 2016-01-17 2020-12-25    2    3 4.9371585    5         0                0         0                0         1                2          3
#4   A2 2017-06-30 2018-10-12    1    1 3.4712329    5         0                0         0                0         3                9          3
#5   A2 2018-07-11 2019-09-24    4    3 2.4410959    5         0                0         0                0         3                9          3
#6   A2 2018-11-28 2020-12-19    4    2 2.0575342    5         0                0         0                0         3                9          3
#7   A3 2019-09-04 2018-10-22    4    5 1.2931507    2         2                8         3               12         3               12          3
#8   A3 2020-02-29 2019-06-14    4    1 0.8060109    1         2                8         3               12         3               12          3
#9   A3 2020-07-12 2020-12-20    4    2 0.4410959    1         2                8         3               12         3               12          3
#  Last10SumCol4Col7
#1                 8
#2                 8
#3                 8
#4                 9
#5                 9
#6                 9
#7                12
#8                12
#9                12

The issue in OP's code giving wrong sum is because Data_Frame[Data_Frame$Col7 <=2, ] is breaking the group and is getting the whole column subset instead of those within the group. Within tidyverse, we don't need to Data_Frame$, if we need to specify the data, use . or cur_data(). Also, here we just need Col7 <=2

Upvotes: 1

Sathish
Sathish

Reputation: 12703

Use cut() for getting column Col7

library(data.table)
setDT(df1)[, `:=` (Col2 = as.Date(Col2), Col3 = as.Date(Col3) )]
df1[, Col6 := lubridate::time_length(lubridate::interval(Col2, max(Col3)), "years"), by = Col1]
df1[, Col7 := as.integer(as.character(cut(Col6, breaks = c(0, 1,2,5,10), labels = c(1,2,5,10)))), by = Col1]

df1[, `:=` (Last1Col7 = 0, Last2Col7 = 0, Last5Col7 = 0, Last10Col7 = 0,
            Last1SumCol4Col7 = 0, Last2SumCol4Col7 = 0, Last5SumCol4Col7 = 0, Last10SumCol4Col7 = 0) ]

df1[Col7 <= 1, `:=` (Last1Col7 = .N, Last1SumCol4Col7 = sum(Col4)), by = Col1]
df1[Col7 <= 2, `:=` (Last2Col7 = .N, Last2SumCol4Col7 = sum(Col4)), by = Col1]
df1[Col7 <= 5, `:=` (Last5Col7 = .N, Last5SumCol4Col7 = sum(Col4)), by = Col1]
df1[Col7 <= 10, `:=` (Last10Col7 = .N, Last10SumCol4Col7 = sum(Col4)), by = Col1]

Output:

df1
   Col1       Col2       Col3 Col4 Col5      Col6 Col7 Last1Col7 Last2Col7
1:   A1 2011-03-11 2018-10-22    4    7 9.7917808   10         0         0
2:   A1 2014-08-21 2019-05-24    2    6 6.3452055   10         0         0
3:   A1 2016-01-17 2020-12-25    2    3 4.9371585    5         0         0
4:   A2 2017-06-30 2018-10-12    1    1 3.4712329    5         0         0
5:   A2 2018-07-11 2019-09-24    4    3 2.4410959    5         0         0
6:   A2 2018-11-28 2020-12-19    4    2 2.0575342    5         0         0
7:   A3 2019-09-04 2018-10-22    4    5 1.2931507    2         0         3
8:   A3 2020-02-29 2019-06-14    4    1 0.8060109    1         2         3
9:   A3 2020-07-12 2020-12-20    4    2 0.4410959    1         2         3
   Last5Col7 Last10Col7 Last1SumCol4Col7 Last2SumCol4Col7 Last5SumCol4Col7
1:         0          3                0                0                0
2:         0          3                0                0                0
3:         1          3                0                0                2
4:         3          3                0                0                9
5:         3          3                0                0                9
6:         3          3                0                0                9
7:         3          3                0               12               12
8:         3          3                8               12               12
9:         3          3                8               12               12
   Last10SumCol4Col7
1:                 8
2:                 8
3:                 8
4:                 9
5:                 9
6:                 9
7:                12
8:                12
9:                12

Data:

df1 <- data.frame(Col1 = c("A1", "A1", "A1", "A2", "A2", "A2", "A3", "A3", "A3"),
    
                         Col2 = c("2011-03-11", "2014-08-21", "2016-01-17", "2017-06-30", "2018-07-11", "2018-11-28", "2019-09-04", "2020-02-29", "2020-07-12"),
                  
                         Col3 = c("2018-10-22", "2019-05-24", "2020-12-25", "2018-10-12", "2019-09-24", "2020-12-19", "2018-10-22", "2019-06-14", "2020-12-20"),
              
                         Col4 = c(4, 2, 2, 1, 4, 4, 4, 4, 4),
             
                         Col5 = c(7, 6, 3, 1, 3, 2, 5, 1, 2))

Upvotes: 1

Related Questions