Reputation: 115
I am working with a data set similar to the sample I created below, where each customer's activity is logged:
sample_data <- data.frame(customer_id = c(1000, 1000,1000, 1000,1000, 1000, 2000, 3000,3000,3000, 4000,4000),
activity_date = as.Date(c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-01", "2020-02-29", "2020-03-01", "2020-01-02", "2020-01-01", "2020-03-04", "2020-05-22", "2020-03-05", "2020-06-01"), format = "%Y-%m-%d"),
activity = c("like", "purchase", "like", "visit", "email", "like", "purchase", "visit", "purchase", "visit", "like", "email"))
For my final data set, I would like to add two columns with calculated "sequences" to the data, where each column indicates a different kind of sequence.
customer_id
level, each rule should be counted consecutively. However, rules happening on the same date should also indicate the same sequence, meaning that counting only continues when the date changes.customer_id
level, each individual rule should start with the sequence 1 and continue counting on a rule-basis, depending on how often the specific rule appears per customer. Again, rules happening on the same date should indicate the same sequence.I have come up with the following dplyr
code so far, which has two issues:
test_result <- sample_data %>%
dplyr::group_by(customer_id) %>%
dplyr::arrange(activity_date) %>%
dplyr::mutate(general_sequence=1:n()) %>% dplyr::ungroup()
test_result
, the count starts with 1 and continues counting, even when rules were tracked on the same day.For more clarity, I created a table that shows how I would like the final result to look like:
final_data <- data.frame(customer_id = c(1000, 1000,1000, 1000,1000, 1000, 2000, 3000,3000,3000, 4000,4000),
activity_date = as.Date(c("2020-01-01", "2020-01-01", "2020-01-01", "2020-01-01", "2020-02-29", "2020-03-01", "2020-01-02", "2020-01-01", "2020-03-04", "2020-05-22", "2020-03-05", "2020-06-01"), format = "%Y-%m-%d"),
activity = c("like", "purchase", "like", "visit", "purchase", "like", "purchase", "visit", "purchase", "visit", "like", "email"),
general_sequence = c(1, 1, 1, 1, 2, 3, 1, 1, 2, 3, 1, 2),
rule_sequence = c(1, 1, 2, 1, 2, 3, 1, 1, 1, 2, 1, 1))
Any help is highly appreciated! Thanks!
Upvotes: 1
Views: 541
Reputation: 18581
This would be one approach using {dplyr} and data.table::rleid
:
library(dplyr)
sample_data %>%
group_by(customer_id) %>%
mutate(general_sequence = data.table::rleid(activity_date)) %>%
group_by(customer_id, activity) %>%
mutate(rule_sequence = row_number())
#> # A tibble: 12 x 5
#> # Groups: customer_id, activity [9]
#> customer_id activity_date activity general_sequence rule_sequence
#> <dbl> <date> <chr> <int> <int>
#> 1 1000 2020-01-01 like 1 1
#> 2 1000 2020-01-01 purchase 1 1
#> 3 1000 2020-01-01 like 1 2
#> 4 1000 2020-01-01 visit 1 1
#> 5 1000 2020-02-29 email 2 1
#> 6 1000 2020-03-01 like 3 3
#> 7 2000 2020-01-02 purchase 1 1
#> 8 3000 2020-01-01 visit 1 1
#> 9 3000 2020-03-04 purchase 2 1
#> 10 3000 2020-05-22 visit 3 2
#> 11 4000 2020-03-05 like 1 1
#> 12 4000 2020-06-01 email 2 1
Created on 2021-05-11 by the reprex package (v0.3.0)
Upvotes: 1