Sean
Sean

Reputation: 185

Group dataframe rows by creating a unique ID column based on the amount of time passed between entries and variable values

I'm trying to group the rows of my dataframe into "courses" when the same variables appear at regular date intervals. When there is a gap in time frequency or when one of variables change I would like to give it a new course ID.

To give an example, my data looks something like this:

         Date     Name   Item
1  2018-06-02    Johan  Apple
2  2018-07-05    Johan  Apple
3  2018-08-02    Johan  Apple
4  2019-04-15    Johan  Apple
5  2019-05-15    Johan  Apple
6  2019-05-30 Samantha Orange
7  2019-06-12 Samantha Orange
8  2019-06-27 Samantha Orange
9  2018-02-15     Mary  Lemon
10 2018-04-10     Mary  Lemon
11 2018-06-12     Mary  Lemon
12 2018-08-13     Mary   Lime
13 2018-08-27     Mary   Lime
14 2017-03-09   George   Kiwi

Each different combination of Name and Item should generate a new course ID.

However (the tricky part) if there is a significant time gap between two transactions where the other variables are constant, defined as: either more than 6 months or more than three times the average interval up to that date for that specific combination of Item and Name then it should be given a new CourseID

In my example:

Code to reproduce:

data.frame(Date = as.Date(c("2018-06-02", "2018-07-05", "2018-08-02", "2019-04-15", "2019-05-15", "2019-05-30", "2019-06-12", "2019-06-27", "2018-02-15", "2018-04-10", "2018-06-12", "2018-08-13", "2018-08-27", "2017-03-09")),
           Name = c(rep("Johan", 5), rep("Samantha", 3), rep("Mary", 5), "George"),
           Item = c(rep("Apple", 5), rep("Orange", 3), rep("Lemon", 3), rep("Lime",2), "Kiwi"))

I'd like to create an additional column which has a unique identifier for each course - i.e. using stringi or similar.

Ideally the output would look something like this:

         Date     Name   Item CourseID
1  2018-06-02    Johan  Apple      q3J
2  2018-07-05    Johan  Apple      q3J
3  2018-08-02    Johan  Apple      q3J
4  2019-04-15    Johan  Apple      f8j
5  2019-05-15    Johan  Apple      f8j
6  2019-05-30 Samantha Orange      p8U
7  2019-06-12 Samantha Orange      p8U
8  2019-06-27 Samantha Orange      p8U
9  2018-02-15     Mary  Lemon      wi9
10 2018-04-10     Mary  Lemon      wi9
11 2018-06-12     Mary  Lemon      wi9
12 2018-08-13     Mary   Lime      q8U
13 2018-08-27     Mary   Lime      q8U
14 2017-03-09   George   Kiwi      jJ0

I've tried going about this using max/min on the date varaible, however I'm stumped when it comes to identifying the break based on the previous purchasing pattern.

There may be a package I don't know which has something for this, however I've been trying with Tidyverse so far.

Upvotes: 0

Views: 479

Answers (1)

Jon Spring
Jon Spring

Reputation: 66415

Here's a dplyr approach that calculates the gap and rolling avg gap within each Name/Item group, then flags large gaps, and assigns a new group for each large gap or change in Name or Item.

df1 %>%
  group_by(Name,Item) %>%
  mutate(purch_num = row_number(),
         time_since_first = Date - first(Date),
         gap = Date - lag(Date, default = as.Date(-Inf)),
         avg_gap = time_since_first / (purch_num-1),
         new_grp_flag = gap > 180 | gap > 3*avg_gap) %>%
  ungroup() %>%
  mutate(group = cumsum(new_grp_flag))

Upvotes: 1

Related Questions