emil_rore
emil_rore

Reputation: 115

Calculating grouped sequences in R with dplyr

I am working with a data set similar to the sample I created below, where each customer's activity is logged:

sample_data <- data.frame(customer_id = c(1000, 1000,1000, 1000,1000, 1000, 2000, 3000,3000,3000, 4000,4000),
           activity_date = as.Date(c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-01", "2020-02-29", "2020-03-01", "2020-01-02", "2020-01-01", "2020-03-04", "2020-05-22", "2020-03-05", "2020-06-01"), format = "%Y-%m-%d"),
           activity = c("like", "purchase", "like", "visit", "email", "like", "purchase", "visit", "purchase", "visit", "like", "email"))

For my final data set, I would like to add two columns with calculated "sequences" to the data, where each column indicates a different kind of sequence.

  1. General sequence: Grouped on a customer_id level, each rule should be counted consecutively. However, rules happening on the same date should also indicate the same sequence, meaning that counting only continues when the date changes.
  2. Rule sequence: Grouped on a customer_id level, each individual rule should start with the sequence 1 and continue counting on a rule-basis, depending on how often the specific rule appears per customer. Again, rules happening on the same date should indicate the same sequence.

I have come up with the following dplyr code so far, which has two issues:

test_result <- sample_data  %>%
  dplyr::group_by(customer_id) %>% 
  dplyr::arrange(activity_date) %>% 
  dplyr::mutate(general_sequence=1:n()) %>% dplyr::ungroup()

For more clarity, I created a table that shows how I would like the final result to look like:

final_data <- data.frame(customer_id = c(1000, 1000,1000, 1000,1000, 1000, 2000, 3000,3000,3000, 4000,4000),
                         activity_date = as.Date(c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-01", "2020-02-29", "2020-03-01", "2020-01-02", "2020-01-01", "2020-03-04", "2020-05-22", "2020-03-05", "2020-06-01"), format = "%Y-%m-%d"),
                         activity = c("like", "purchase", "like", "visit", "purchase", "like", "purchase", "visit", "purchase", "visit", "like", "email"),
                         general_sequence = c(1, 1, 1, 1, 2, 3, 1, 1, 2, 3, 1, 2),
                         rule_sequence = c(1, 1, 2, 1, 2, 3, 1, 1, 1, 2, 1, 1))

Any help is highly appreciated! Thanks!

Upvotes: 1

Views: 541

Answers (1)

TimTeaFan
TimTeaFan

Reputation: 18581

This would be one approach using {dplyr} and data.table::rleid:

library(dplyr)

sample_data %>% 
  group_by(customer_id) %>%
  mutate(general_sequence = data.table::rleid(activity_date)) %>% 
  group_by(customer_id, activity) %>%
  mutate(rule_sequence = row_number())

#> # A tibble: 12 x 5
#> # Groups:   customer_id, activity [9]
#>    customer_id activity_date activity general_sequence rule_sequence
#>          <dbl> <date>        <chr>               <int>         <int>
#>  1        1000 2020-01-01    like                    1             1
#>  2        1000 2020-01-01    purchase                1             1
#>  3        1000 2020-01-01    like                    1             2
#>  4        1000 2020-01-01    visit                   1             1
#>  5        1000 2020-02-29    email                   2             1
#>  6        1000 2020-03-01    like                    3             3
#>  7        2000 2020-01-02    purchase                1             1
#>  8        3000 2020-01-01    visit                   1             1
#>  9        3000 2020-03-04    purchase                2             1
#> 10        3000 2020-05-22    visit                   3             2
#> 11        4000 2020-03-05    like                    1             1
#> 12        4000 2020-06-01    email                   2             1

Created on 2021-05-11 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions