Reputation: 185
I'm trying to group the rows of my dataframe into "courses" when the same variables appear at regular date intervals. When there is a gap in time frequency or when one of variables change I would like to give it a new course ID.
To give an example, my data looks something like this:
Date Name Item
1 2018-06-02 Johan Apple
2 2018-07-05 Johan Apple
3 2018-08-02 Johan Apple
4 2019-04-15 Johan Apple
5 2019-05-15 Johan Apple
6 2019-05-30 Samantha Orange
7 2019-06-12 Samantha Orange
8 2019-06-27 Samantha Orange
9 2018-02-15 Mary Lemon
10 2018-04-10 Mary Lemon
11 2018-06-12 Mary Lemon
12 2018-08-13 Mary Lime
13 2018-08-27 Mary Lime
14 2017-03-09 George Kiwi
Each different combination of Name
and Item
should generate a new course ID.
However (the tricky part) if there is a significant time gap between two transactions where the other variables are constant, defined as: either more than 6 months or more than three times the average interval up to that date for that specific combination of Item
and Name
then it should be given a new CourseID
In my example:
CourseID
. Ideally the interval to check for future breaks would then be based on the average in this new group.CourseID
.CourseID
s.CourseID
Code to reproduce:
data.frame(Date = as.Date(c("2018-06-02", "2018-07-05", "2018-08-02", "2019-04-15", "2019-05-15", "2019-05-30", "2019-06-12", "2019-06-27", "2018-02-15", "2018-04-10", "2018-06-12", "2018-08-13", "2018-08-27", "2017-03-09")),
Name = c(rep("Johan", 5), rep("Samantha", 3), rep("Mary", 5), "George"),
Item = c(rep("Apple", 5), rep("Orange", 3), rep("Lemon", 3), rep("Lime",2), "Kiwi"))
I'd like to create an additional column which has a unique identifier for each course - i.e. using stringi or similar.
Ideally the output would look something like this:
Date Name Item CourseID
1 2018-06-02 Johan Apple q3J
2 2018-07-05 Johan Apple q3J
3 2018-08-02 Johan Apple q3J
4 2019-04-15 Johan Apple f8j
5 2019-05-15 Johan Apple f8j
6 2019-05-30 Samantha Orange p8U
7 2019-06-12 Samantha Orange p8U
8 2019-06-27 Samantha Orange p8U
9 2018-02-15 Mary Lemon wi9
10 2018-04-10 Mary Lemon wi9
11 2018-06-12 Mary Lemon wi9
12 2018-08-13 Mary Lime q8U
13 2018-08-27 Mary Lime q8U
14 2017-03-09 George Kiwi jJ0
I've tried going about this using max/min on the date varaible, however I'm stumped when it comes to identifying the break based on the previous purchasing pattern.
There may be a package I don't know which has something for this, however I've been trying with Tidyverse so far.
Upvotes: 0
Views: 479
Reputation: 66415
Here's a dplyr approach that calculates the gap and rolling avg gap within each Name/Item group, then flags large gaps, and assigns a new group for each large gap or change in Name or Item.
df1 %>%
group_by(Name,Item) %>%
mutate(purch_num = row_number(),
time_since_first = Date - first(Date),
gap = Date - lag(Date, default = as.Date(-Inf)),
avg_gap = time_since_first / (purch_num-1),
new_grp_flag = gap > 180 | gap > 3*avg_gap) %>%
ungroup() %>%
mutate(group = cumsum(new_grp_flag))
Upvotes: 1