Calculating grouped sequences in R with dplyr

Question

I am working with a data set similar to the sample I created below, where each customer's activity is logged:

sample_data <- data.frame(customer_id = c(1000, 1000,1000, 1000,1000, 1000, 2000, 3000,3000,3000, 4000,4000),
           activity_date = as.Date(c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-01", "2020-02-29", "2020-03-01", "2020-01-02", "2020-01-01", "2020-03-04", "2020-05-22", "2020-03-05", "2020-06-01"), format = "%Y-%m-%d"),
           activity = c("like", "purchase", "like", "visit", "email", "like", "purchase", "visit", "purchase", "visit", "like", "email"))

For my final data set, I would like to add two columns with calculated "sequences" to the data, where each column indicates a different kind of sequence.

General sequence: Grouped on a customer_id level, each rule should be counted consecutively. However, rules happening on the same date should also indicate the same sequence, meaning that counting only continues when the date changes.
Rule sequence: Grouped on a customer_id level, each individual rule should start with the sequence 1 and continue counting on a rule-basis, depending on how often the specific rule appears per customer. Again, rules happening on the same date should indicate the same sequence.

I have come up with the following dplyr code so far, which has two issues:

test_result <- sample_data  %>%
  dplyr::group_by(customer_id) %>% 
  dplyr::arrange(activity_date) %>% 
  dplyr::mutate(general_sequence=1:n()) %>% dplyr::ungroup()

Rules tracked on the same date do not have the same sequence. As you can see in the test_result, the count starts with 1 and continues counting, even when rules were tracked on the same day.
I did not manage to calculate the column "Rule sequence" at all. I assume that I would need to apply a different grouping in order to get the result (maybe based on "rule"?)

For more clarity, I created a table that shows how I would like the final result to look like:

final_data <- data.frame(customer_id = c(1000, 1000,1000, 1000,1000, 1000, 2000, 3000,3000,3000, 4000,4000),
                         activity_date = as.Date(c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-01", "2020-02-29", "2020-03-01", "2020-01-02", "2020-01-01", "2020-03-04", "2020-05-22", "2020-03-05", "2020-06-01"), format = "%Y-%m-%d"),
                         activity = c("like", "purchase", "like", "visit", "purchase", "like", "purchase", "visit", "purchase", "visit", "like", "email"),
                         general_sequence = c(1, 1, 1, 1, 2, 3, 1, 1, 2, 3, 1, 2),
                         rule_sequence = c(1, 1, 2, 1, 2, 3, 1, 1, 1, 2, 1, 1))

Any help is highly appreciated! Thanks!

Calculating grouped sequences in R with dplyr

Answers (1)

Related Questions