Manu
Manu

Reputation: 1120

cumulate sum grouped in R

I have the following table:

day_1 = as.Date('2020-03-02')
day_2 = as.Date('2021-03-02')
day_3 = as.Date('2021-02-02')
data_example <- data.frame(pay_date = c(rep(day_1,4),rep(day_2,3),rep(day_3,3)),
                           arrears = c(0,0:2,rep(0:2,2)),
                           to_pay = c(2:5,2:4,4:6),
                           paid = c(2:5,2,3,0,4,0,0))

The data is:

> data_example
     pay_date arrears to_pay paid
1  2020-03-02       0      2    2
2  2020-03-02       0      3    3
3  2020-03-02       1      4    4
4  2020-03-02       2      5    5
5  2021-03-02       0      2    2
6  2021-03-02       1      3    3
7  2021-03-02       2      4    0
8  2021-02-02       0      4    4
9  2021-02-02       1      5    0
10 2021-02-02       2      6    0

I want to aggregate the data by pay_date and arrears, and accumulate the amounts to_pay (in col1) and paid (in col2). Then I want to create col 3, which is the cumulate sum of col 1. For example, for '2020-03-02', for arrears 0, the cumulative sum is 5. For arrears 1, the cumulative sum is 5 (from arrears 0) + 4 (from arrears 1) = 9. For arrears 2, the cumulative sum is 5 + 4 + 5 (from arrears 2:

  pay_date   arrears  col1  col2  col3  
* <date>       <dbl> <int> <dbl> <int> 
1 2020-03-02       0     5     5     5  
2 2020-03-02       1     4     4     9 
3 2020-03-02       2     5     5    14 

The next calculation is col 4, which is the cumulative sum of col 2.

Col 5 is the total sum of to_pay by pay_date.

The payment performance is col4/col5. This is the expected result:

  pay_date   arrears  col1  col2  col3  col4  col5 `Payment performance`
* <date>       <dbl> <int> <dbl> <int> <dbl> <int>                 <dbl>
1 2020-03-02       0     5     5     5     5    14                  0.36
2 2020-03-02       1     4     4     9     9    14                  0.64
3 2020-03-02       2     5     5    14    14    14                  1.00
4 2021-02-02       0     4     4     4     4    15                  0.27
5 2021-02-02       1     5     0     9     4    15                  0.27
6 2021-02-02       2     6     0    15     4    15                  0.27
7 2021-03-02       0     2     2     2     2     9                  0.22
8 2021-03-02       1     3     3     5     5     9                  0.56
9 2021-03-02       2     4     0     9     5     9                  0.56

I tried this code but I failed to group the sums:

data_example %>% 
  group_by(pay_date, arrears) %>%
  summarise(col1 = sum(to_pay),
            col2 = sum(paid),
            .groups = "drop") %>%
  mutate(col3 = cumsum(col1),
         col4 = cumsum(col2),
         col5 = sum(col1),
         `Payment performance` = round(col4/col5, digits = 2))

Please, can you help me? In this case my problem is cumsum, because it sums all the values in the column, but I want to sum it by pay_date. Any help will be greatly appreciated.

Upvotes: 1

Views: 55

Answers (2)

Chriss Paul
Chriss Paul

Reputation: 1101

Isn't it just removing .groups = "drop" which gets your desired result

data_example %>% 
  group_by(pay_date, arrears) %>%
  summarise(col1 = sum(to_pay),
            col2 = sum(paid)
            ) %>%
  mutate(col3 = cumsum(col1),
         col4 = cumsum(col2),
         col5 = sum(col1),
         `Payment performance` = round(col4/col5, digits = 2))

 #A tibble: 9 x 8
# Groups:   pay_date [3]
  pay_date   arrears  col1  col2  col3  col4  col5 `Payment performance`
  <date>       <dbl> <int> <dbl> <int> <dbl> <int>                 <dbl>
1 2020-03-02       0     5     5     5     5    14                  0.36
2 2020-03-02       1     4     4     9     9    14                  0.64
3 2020-03-02       2     5     5    14    14    14                  1   
4 2021-02-02       0     4     4     4     4    15                  0.27
5 2021-02-02       1     5     0     9     4    15                  0.27
6 2021-02-02       2     6     0    15     4    15                  0.27
7 2021-03-02       0     2     2     2     2     9                  0.22
8 2021-03-02       1     3     3     5     5     9                  0.56
9 2021-03-02       2     4     0     9     5     9                  0.56

Another solution using data.table :

library(data.table)
setDT(data_example)
df1 <- data_example[, .(col1 = sum(to_pay),
                        col2 = sum(paid)),
                    .(pay_date, arrears)][order(pay_date)]

df1[, `:=`(col3 = cumsum(col1), col4 = cumsum(col2)), pay_date]
df1[, col5 := sum(col1), pay_date]

#Finally
df1[, `Payment performance` := round(col4 / col5, digits = 2)]
df1
#Same results
     pay_date arrears col1 col2 col3 col4 col5 Payment performance
1: 2020-03-02       0    5    5    5    5   14                0.36
2: 2020-03-02       1    4    4    9    9   14                0.64
3: 2020-03-02       2    5    5   14   14   14                1.00
4: 2021-02-02       0    4    4    4    4   15                0.27
5: 2021-02-02       1    5    0    9    4   15                0.27
6: 2021-02-02       2    6    0   15    4   15                0.27
7: 2021-03-02       0    2    2    2    2    9                0.22
8: 2021-03-02       1    3    3    5    5    9                0.56
9: 2021-03-02       2    4    0    9    5    9                0.56

Upvotes: 3

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

add one more level group_by(tmp = data.table::rleid(pay_date)) after summarise

data_example %>% 
  group_by(pay_date, arrears) %>%
  summarise(col1 = sum(to_pay),
            col2 = sum(paid),
            .groups = "drop") %>%
  group_by(tmp = data.table::rleid(pay_date)) %>% 
  mutate(col3 = cumsum(col1),
         col4 = cumsum(col2),
         col5 = sum(col1),
         `Payment performance` = round(col4/col5, digits = 2))

# A tibble: 9 x 9
# Groups:   tmp [3]
  pay_date   arrears  col1  col2   tmp  col3  col4  col5 `Payment performance`
  <date>       <dbl> <int> <dbl> <int> <int> <dbl> <int>                 <dbl>
1 2020-03-02       0     5     5     1     5     5    14                  0.36
2 2020-03-02       1     4     4     1     9     9    14                  0.64
3 2020-03-02       2     5     5     1    14    14    14                  1   
4 2021-02-02       0     4     4     2     4     4    15                  0.27
5 2021-02-02       1     5     0     2     9     4    15                  0.27
6 2021-02-02       2     6     0     2    15     4    15                  0.27
7 2021-03-02       0     2     2     3     2     2     9                  0.22
8 2021-03-02       1     3     3     3     5     5     9                  0.56
9 2021-03-02       2     4     0     3     9     5     9                  0.56

Upvotes: 1

Related Questions