Reputation: 351
Let
Data_Frame <- data.frame(Col1 = c("A1", "A1", "A1", "A2", "A2", "A2", "A3", "A3", "A3"),
Col2 = c("2011-03-11", "2014-08-21", "2016-01-17", "2017-06-30", "2018-07-11", "2018-11-28", "2019-09-04", "2020-02-29", "2020-07-12"),
Col3 = c("2018-10-22", "2019-05-24", "2020-12-25", "2018-10-12", "2019-09-24", "2020-12-19", "2018-10-22", "2019-06-14", "2020-12-20"),
Col4 = c(4, 2, 2, 1, 4, 4, 4, 4, 4),
Col5 = c(7, 6, 3, 1, 3, 2, 5, 1, 2))
Data_Frame$Col2 <- as.Date(Data_Frame$Col2)
Data_Frame$Col3 <- as.Date(Data_Frame$Col3)
Data_Frame$Col1 <- as.factor(Data_Frame$Col1)
Data_Frame <- Data_Frame %>% group_by(Col1) %>% mutate(Col6 = lubridate::time_length(lubridate::interval(Col2, max(Col3)), "years"))
Data_Frame <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Col7 = ifelse(Col6 <= 1, 1, ifelse(Col6 >1 & Col6 <=2, 2, ifelse(Col6 >2 & Col6 <=5, 5, ifelse(Col6 >5 & Col6 <=10, 10, 11)))))
Data_Frame <- as.data.frame(Data_Frame)
be the dataframe where Col6 represents the time difference between Col2 and Col3 with elements of Col2 subtracted from the biggest date element from Col3 in each group A1 to A3 in Col1, and Col7 represents which elements in Col6 is <=1, <=2, <=5 and <=10.
There are problems in additional columns generated with different conditions.
New columns Last1Col7 to Last10Col7 are created based on Col7 and groups A1 to A3 in Col7 such that
The expected result is:
The following code:
Data_Frame1 <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Last1Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 1, ]),
Last2Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 2, ]),
Last5Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 5, ]),
Last10Col7 = nrow(Data_Frame[Data_Frame$Col7 <= 10, ]))
leads to a totally different result:
Generation of Last1SumCol4Col7 to Last10SumCol4Col7:
Last1SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=1 in each group A1 to A3 in Col1,
Last2SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=2 in each group A1 to A3 in Col1,
Last5SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=5 in each group A1 to A3 in Col1,
Last10SumCol4Col7 is the sum of entries in Col4 corresponding to how many entries (number of rows) in Col7 are <=10 in each group A1 to A3 in Col1
Using the following code:
Data_Frame1 <- Data_Frame %>% group_by(Col1) %>% dplyr::mutate(Last1SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=1, ]$Col4),
Last2SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=2, ]$Col4),
Last5SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=5, ]$Col4),
Last10SumCol4Col7 = sum(Data_Frame[Data_Frame$Col7 <=10, ]$Col4))
Starting with columns where all the initial entries of Last1Col7 to Last10Col7 and Last1SumCol4Col7 to Last10SumCol4Col7 are zeros and then using the codes above doesn't help either. What is it that is fundamentally going wrong in the codes under 1 and 3?
Upvotes: 4
Views: 245
Reputation: 886938
We could use map
to loop over the values used in comparison, then grouped by 'Col1', create the two columns within each loop, by taking the sum
of 'Col7' which is less than or equal to the value looped, and the sum
of corresponding values of 'Col4', where 'Col7' is less than or equal to the value
library(purrr)
library(dplyr)
map_dfc(c(1, 2, 5, 10), ~ Data_Frame %>%
group_by(Col1) %>%
transmute(!! sprintf("Last%dCol7", .x) := sum(Col7 <= .x),
!! sprintf("Last%dSumCol4Col7", .x) := sum(Col4[Col7<= .x])) %>%
ungroup %>%
select(-Col1)) %>%
bind_cols(Data_Frame, .)
-output
#Col1 Col2 Col3 Col4 Col5 Col6 Col7 Last1Col7 Last1SumCol4Col7 Last2Col7 Last2SumCol4Col7 Last5Col7 Last5SumCol4Col7 Last10Col7
#1 A1 2011-03-11 2018-10-22 4 7 9.7917808 10 0 0 0 0 1 2 3
#2 A1 2014-08-21 2019-05-24 2 6 6.3452055 10 0 0 0 0 1 2 3
#3 A1 2016-01-17 2020-12-25 2 3 4.9371585 5 0 0 0 0 1 2 3
#4 A2 2017-06-30 2018-10-12 1 1 3.4712329 5 0 0 0 0 3 9 3
#5 A2 2018-07-11 2019-09-24 4 3 2.4410959 5 0 0 0 0 3 9 3
#6 A2 2018-11-28 2020-12-19 4 2 2.0575342 5 0 0 0 0 3 9 3
#7 A3 2019-09-04 2018-10-22 4 5 1.2931507 2 2 8 3 12 3 12 3
#8 A3 2020-02-29 2019-06-14 4 1 0.8060109 1 2 8 3 12 3 12 3
#9 A3 2020-07-12 2020-12-20 4 2 0.4410959 1 2 8 3 12 3 12 3
# Last10SumCol4Col7
#1 8
#2 8
#3 8
#4 9
#5 9
#6 9
#7 12
#8 12
#9 12
The issue in OP's code giving wrong sum
is because Data_Frame[Data_Frame$Col7 <=2, ]
is breaking the group and is getting the whole column subset instead of those within the group. Within tidyverse
, we don't need to Data_Frame$
, if we need to specify the data, use .
or cur_data()
. Also, here we just need Col7 <=2
Upvotes: 1
Reputation: 12703
Use cut()
for getting column Col7
library(data.table)
setDT(df1)[, `:=` (Col2 = as.Date(Col2), Col3 = as.Date(Col3) )]
df1[, Col6 := lubridate::time_length(lubridate::interval(Col2, max(Col3)), "years"), by = Col1]
df1[, Col7 := as.integer(as.character(cut(Col6, breaks = c(0, 1,2,5,10), labels = c(1,2,5,10)))), by = Col1]
df1[, `:=` (Last1Col7 = 0, Last2Col7 = 0, Last5Col7 = 0, Last10Col7 = 0,
Last1SumCol4Col7 = 0, Last2SumCol4Col7 = 0, Last5SumCol4Col7 = 0, Last10SumCol4Col7 = 0) ]
df1[Col7 <= 1, `:=` (Last1Col7 = .N, Last1SumCol4Col7 = sum(Col4)), by = Col1]
df1[Col7 <= 2, `:=` (Last2Col7 = .N, Last2SumCol4Col7 = sum(Col4)), by = Col1]
df1[Col7 <= 5, `:=` (Last5Col7 = .N, Last5SumCol4Col7 = sum(Col4)), by = Col1]
df1[Col7 <= 10, `:=` (Last10Col7 = .N, Last10SumCol4Col7 = sum(Col4)), by = Col1]
Output:
df1
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Last1Col7 Last2Col7
1: A1 2011-03-11 2018-10-22 4 7 9.7917808 10 0 0
2: A1 2014-08-21 2019-05-24 2 6 6.3452055 10 0 0
3: A1 2016-01-17 2020-12-25 2 3 4.9371585 5 0 0
4: A2 2017-06-30 2018-10-12 1 1 3.4712329 5 0 0
5: A2 2018-07-11 2019-09-24 4 3 2.4410959 5 0 0
6: A2 2018-11-28 2020-12-19 4 2 2.0575342 5 0 0
7: A3 2019-09-04 2018-10-22 4 5 1.2931507 2 0 3
8: A3 2020-02-29 2019-06-14 4 1 0.8060109 1 2 3
9: A3 2020-07-12 2020-12-20 4 2 0.4410959 1 2 3
Last5Col7 Last10Col7 Last1SumCol4Col7 Last2SumCol4Col7 Last5SumCol4Col7
1: 0 3 0 0 0
2: 0 3 0 0 0
3: 1 3 0 0 2
4: 3 3 0 0 9
5: 3 3 0 0 9
6: 3 3 0 0 9
7: 3 3 0 12 12
8: 3 3 8 12 12
9: 3 3 8 12 12
Last10SumCol4Col7
1: 8
2: 8
3: 8
4: 9
5: 9
6: 9
7: 12
8: 12
9: 12
Data:
df1 <- data.frame(Col1 = c("A1", "A1", "A1", "A2", "A2", "A2", "A3", "A3", "A3"),
Col2 = c("2011-03-11", "2014-08-21", "2016-01-17", "2017-06-30", "2018-07-11", "2018-11-28", "2019-09-04", "2020-02-29", "2020-07-12"),
Col3 = c("2018-10-22", "2019-05-24", "2020-12-25", "2018-10-12", "2019-09-24", "2020-12-19", "2018-10-22", "2019-06-14", "2020-12-20"),
Col4 = c(4, 2, 2, 1, 4, 4, 4, 4, 4),
Col5 = c(7, 6, 3, 1, 3, 2, 5, 1, 2))
Upvotes: 1