Reputation: 1120
I have the following table:
day_1 = as.Date('2020-03-02')
day_2 = as.Date('2021-03-02')
day_3 = as.Date('2021-02-02')
data_example <- data.frame(pay_date = c(rep(day_1,4),rep(day_2,3),rep(day_3,3)),
arrears = c(0,0:2,rep(0:2,2)),
to_pay = c(2:5,2:4,4:6),
paid = c(2:5,2,3,0,4,0,0))
The data is:
> data_example
pay_date arrears to_pay paid
1 2020-03-02 0 2 2
2 2020-03-02 0 3 3
3 2020-03-02 1 4 4
4 2020-03-02 2 5 5
5 2021-03-02 0 2 2
6 2021-03-02 1 3 3
7 2021-03-02 2 4 0
8 2021-02-02 0 4 4
9 2021-02-02 1 5 0
10 2021-02-02 2 6 0
I want to aggregate the data by pay_date and arrears, and accumulate the amounts to_pay (in col1) and paid (in col2). Then I want to create col 3, which is the cumulate sum of col 1
. For example, for '2020-03-02', for arrears 0, the cumulative sum is 5. For arrears 1, the cumulative sum is 5 (from arrears 0) + 4 (from arrears 1) = 9. For arrears 2, the cumulative sum is 5 + 4 + 5 (from arrears 2:
pay_date arrears col1 col2 col3
* <date> <dbl> <int> <dbl> <int>
1 2020-03-02 0 5 5 5
2 2020-03-02 1 4 4 9
3 2020-03-02 2 5 5 14
The next calculation is col 4, which is the cumulative sum of col 2.
Col 5 is the total sum of to_pay
by pay_date
.
The payment performance is col4/col5. This is the expected result:
pay_date arrears col1 col2 col3 col4 col5 `Payment performance`
* <date> <dbl> <int> <dbl> <int> <dbl> <int> <dbl>
1 2020-03-02 0 5 5 5 5 14 0.36
2 2020-03-02 1 4 4 9 9 14 0.64
3 2020-03-02 2 5 5 14 14 14 1.00
4 2021-02-02 0 4 4 4 4 15 0.27
5 2021-02-02 1 5 0 9 4 15 0.27
6 2021-02-02 2 6 0 15 4 15 0.27
7 2021-03-02 0 2 2 2 2 9 0.22
8 2021-03-02 1 3 3 5 5 9 0.56
9 2021-03-02 2 4 0 9 5 9 0.56
I tried this code but I failed to group the sums:
data_example %>%
group_by(pay_date, arrears) %>%
summarise(col1 = sum(to_pay),
col2 = sum(paid),
.groups = "drop") %>%
mutate(col3 = cumsum(col1),
col4 = cumsum(col2),
col5 = sum(col1),
`Payment performance` = round(col4/col5, digits = 2))
Please, can you help me? In this case my problem is cumsum, because it sums all the values in the column, but I want to sum it by pay_date
. Any help will be greatly appreciated.
Upvotes: 1
Views: 55
Reputation: 1101
Isn't it just removing .groups = "drop"
which gets your desired result
data_example %>%
group_by(pay_date, arrears) %>%
summarise(col1 = sum(to_pay),
col2 = sum(paid)
) %>%
mutate(col3 = cumsum(col1),
col4 = cumsum(col2),
col5 = sum(col1),
`Payment performance` = round(col4/col5, digits = 2))
#A tibble: 9 x 8
# Groups: pay_date [3]
pay_date arrears col1 col2 col3 col4 col5 `Payment performance`
<date> <dbl> <int> <dbl> <int> <dbl> <int> <dbl>
1 2020-03-02 0 5 5 5 5 14 0.36
2 2020-03-02 1 4 4 9 9 14 0.64
3 2020-03-02 2 5 5 14 14 14 1
4 2021-02-02 0 4 4 4 4 15 0.27
5 2021-02-02 1 5 0 9 4 15 0.27
6 2021-02-02 2 6 0 15 4 15 0.27
7 2021-03-02 0 2 2 2 2 9 0.22
8 2021-03-02 1 3 3 5 5 9 0.56
9 2021-03-02 2 4 0 9 5 9 0.56
Another solution using data.table
:
library(data.table)
setDT(data_example)
df1 <- data_example[, .(col1 = sum(to_pay),
col2 = sum(paid)),
.(pay_date, arrears)][order(pay_date)]
df1[, `:=`(col3 = cumsum(col1), col4 = cumsum(col2)), pay_date]
df1[, col5 := sum(col1), pay_date]
#Finally
df1[, `Payment performance` := round(col4 / col5, digits = 2)]
df1
#Same results
pay_date arrears col1 col2 col3 col4 col5 Payment performance
1: 2020-03-02 0 5 5 5 5 14 0.36
2: 2020-03-02 1 4 4 9 9 14 0.64
3: 2020-03-02 2 5 5 14 14 14 1.00
4: 2021-02-02 0 4 4 4 4 15 0.27
5: 2021-02-02 1 5 0 9 4 15 0.27
6: 2021-02-02 2 6 0 15 4 15 0.27
7: 2021-03-02 0 2 2 2 2 9 0.22
8: 2021-03-02 1 3 3 5 5 9 0.56
9: 2021-03-02 2 4 0 9 5 9 0.56
Upvotes: 3
Reputation: 8880
add one more level group_by(tmp = data.table::rleid(pay_date))
after summarise
data_example %>%
group_by(pay_date, arrears) %>%
summarise(col1 = sum(to_pay),
col2 = sum(paid),
.groups = "drop") %>%
group_by(tmp = data.table::rleid(pay_date)) %>%
mutate(col3 = cumsum(col1),
col4 = cumsum(col2),
col5 = sum(col1),
`Payment performance` = round(col4/col5, digits = 2))
# A tibble: 9 x 9
# Groups: tmp [3]
pay_date arrears col1 col2 tmp col3 col4 col5 `Payment performance`
<date> <dbl> <int> <dbl> <int> <int> <dbl> <int> <dbl>
1 2020-03-02 0 5 5 1 5 5 14 0.36
2 2020-03-02 1 4 4 1 9 9 14 0.64
3 2020-03-02 2 5 5 1 14 14 14 1
4 2021-02-02 0 4 4 2 4 4 15 0.27
5 2021-02-02 1 5 0 2 9 4 15 0.27
6 2021-02-02 2 6 0 2 15 4 15 0.27
7 2021-03-02 0 2 2 3 2 2 9 0.22
8 2021-03-02 1 3 3 3 5 5 9 0.56
9 2021-03-02 2 4 0 3 9 5 9 0.56
Upvotes: 1