Reputation: 31
I have got a follow up question on my previous question about grouping rows on multiple conditions (Previous question).
I was wondering how I can group observations within 31 days of the first date. More importantly, after the 31 days are passed the next date within the same group will be the 'new' first date of that group. Furthermore, after each 'purchase' the grouping should also stop, and the next observation after the purchase will be the 'new' first day of that group.
Let me illustrate it with an example:
example <- structure(
list(
userID = c(1,1,1,1,1,1,2,2,2,2),
date = structure(
c(
18168, #2019-09-29
18189, #2019-10-20
18197, #2019-10-28
18205, #2019-11-05
18205, #2019-11-05
18217, #2019-11-17
18239, #2019-12-09
18270, #2020-01-09
18271, #2020-01-10
18275 #2020-01-14
),
class = "Date"
),
purchase = c(0,0, 0, 0, 0, 1, 0, 0, 1, 0)
),
row.names = c(NA, 10L),
class = "data.frame"
)
Desired outcome:
Outcome <- data.frame(
userID = c(1,1,2,2,2),
date.start = c("2019-09-29", "2019-11-05", "2019-12-09", "2020-01-10", "2020-01-14"),
date.end = c("2019-10-28", "2019-11-17", "2020-01-09", "2020-01-10", "2020-01-14"),
purchase = c(0, 1, 0, 1, 0)
)
Thanks in advance! :)
Upvotes: 0
Views: 372
Reputation: 21908
We could also use the following solution:
library(dplyr)
library(data.table)
example %>%
group_by(grp = cumsum(ifelse(lag(purchase, default = 0) == 1, 1, 0))) %>%
mutate(grp2 = cumsum(as.numeric(date - lag(date, default = first(date)))) > 30) %>%
ungroup() %>%
mutate(grp2 = data.table::rleid(grp2)) %>%
group_by(userID, grp, grp2) %>%
summarise(first = first(date), last = last(date), .groups = "drop") %>%
select(-grp)
# A tibble: 5 x 4
userID grp2 first last
<dbl> <int> <date> <date>
1 1 1 2019-09-29 2019-10-28
2 1 2 2019-11-05 2019-11-17
3 2 3 2019-12-09 2019-12-09
4 2 4 2020-01-09 2020-01-10
5 2 5 2020-01-14 2020-01-14
Upvotes: 1
Reputation: 26218
Like my answer on linked question, I again suggest accumulate
strategy here
library(tidyverse)
example
#> userID date purchase
#> 1 1 2019-09-29 0
#> 2 1 2019-10-20 0
#> 3 1 2019-10-28 0
#> 4 1 2019-11-05 0
#> 5 1 2019-11-05 0
#> 6 1 2019-11-17 1
#> 7 2 2019-12-09 0
#> 8 2 2020-01-09 0
#> 9 2 2020-01-10 1
#> 10 2 2020-01-14 0
example %>% group_by(userID) %>%
group_by(grp = unlist(accumulate2(date, purchase[-n()], ~ if(as.numeric(..2 - ..1) < 31 & ..3 != 1) ..1 else ..2)),
grp = with(rle(grp), rep(seq_along(lengths), lengths)), .add = T) %>%
summarise(start.date = first(date),
last.date = last(date), .groups = 'drop')
#> # A tibble: 5 x 4
#> userID grp start.date last.date
#> <dbl> <int> <date> <date>
#> 1 1 1 2019-09-29 2019-10-28
#> 2 1 2 2019-11-05 2019-11-17
#> 3 2 3 2019-12-09 2019-12-09
#> 4 2 4 2020-01-09 2020-01-10
#> 5 2 5 2020-01-14 2020-01-14
Created on 2021-06-13 by the reprex package (v2.0.0)
Upvotes: 1
Reputation: 6931
Because there are dependencies between when one time period ends and the next one starts (given a date, you can only tell if it is the start, middle, or end of a period after investigating every prior record) I can not see any better way of doing this than using a for loop.
Something like the following:
# create output column
example = example %>% mutate(grouping = NA)
# setup tracking variables
current_date = as.Date('1900-01-01')
current_id = -1
prev_purchase = 0
current_group = 0
for(ii in 1:nrow(example)){
# reset on new identity OR on puchase OR on 31 days elapsed
if(example$userID[ii] != current_id # new identity
|| prev_purchase == 1 # just had a purchase
|| example$date[ii] - current_date > 31){ # more than 31 days elapsed
current_date = example$date[ii]
current_id = example$userID[ii]
prev_purchase = example$purchase[ii]
current_group = current_group + 1
example$grouping[ii] = current_group
}
# otherwise step forwards
else {
prev_purchase = example$purchase[ii]
example$grouping[ii] = current_group
}
}
One advantage of this approach, is you can pause after the for loop and check whether the groupings are as expected. The groups can then be collapsed to the requested output using:
output = example %>%
group_by(userID, grouping) %>%
summarise(date.start = min(date),
date.end = max(date),
purchase = max(purchase)) %>%
select(-grouping)
Upvotes: 0