Reputation: 631
I have customer data grouped by customer ID and ordered by date of purchase. I want to add a column that cumulatively counts the number of distinct products that have been ordered to date - i.e:
Input <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03", "C-03", "C-03"),
Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH", "FISH", "FISH"),
Date = c("2018-01-02","2018-01-05","2018-01-03","2018-01-06","2018-01-08","2018-01-12","2018-01-02","2018-01-04", "2018-01-16", "2018-01-20"))
Output <- data.frame(Customer = c("C-01", "C-01", "C-02", "C-02", "C-02", "C-02", "C-03", "C-03", "C-03", "C-03"),
Product = c("COKE", "COKE", "FRIES", "SHAKE", "BURGER", "BURGER", "CHICKEN", "FISH", "FISH", "FISH"),
Date = c("2018-01-02","2018-01-05","2018-01-03","2018-01-06","2018-01-08","2018-01-12","2018-01-02","2018-01-04", "2018-01-16", "2018-01-20"),
Cum_Distinct = c(1, 1, 1, 2, 3, 3, 1, 2, 2, 2))
With the Input data set, I want to create the output data using dplyr. How do I keep a cumulative count of distinct products encountered to date?
Upvotes: 0
Views: 80
Reputation: 887501
We can use match
to get the index of unique
elements of 'Product'
library(dplyr)
Input %>%
group_by(Customer) %>%
mutate(Cum_Distinct = match(Product, unique(Product)))
# A tibble: 10 x 4
# Groups: Customer [3]
# Customer Product Date Cum_Distinct
# <fct> <fct> <fct> <int>
# 1 C-01 COKE 2018-01-02 1
# 2 C-01 COKE 2018-01-05 1
# 3 C-02 FRIES 2018-01-03 1
# 4 C-02 SHAKE 2018-01-06 2
# 5 C-02 BURGER 2018-01-08 3
# 6 C-02 BURGER 2018-01-12 3
# 7 C-03 CHICKEN 2018-01-02 1
# 8 C-03 FISH 2018-01-04 2
# 9 C-03 FISH 2018-01-16 2
#10 C-03 FISH 2018-01-20 2
Or using group_indices
library(tidyverse)
Input %>%
group_by(Customer) %>%
nest %>%
mutate(data = map(data, ~ .x %>%
mutate(Cum_Distinct = group_indices(., Product)))) %>%
Or using base R
Input$Cum_Distinct <- with(Input, as.integer(ave(as.character(Product), Customer,
FUN = function(x) match(x, unique(x)))))
Upvotes: 0
Reputation: 389135
We could take cumulative sum over non-duplicated
values by group.
library(dplyr)
Input %>%
group_by(Customer) %>%
mutate(Cum_Distinct = cumsum(!duplicated(Product)))
# Customer Product Date Cum_Distinct
# <fct> <fct> <fct> <int>
# 1 C-01 COKE 2018-01-02 1
# 2 C-01 COKE 2018-01-05 1
# 3 C-02 FRIES 2018-01-03 1
# 4 C-02 SHAKE 2018-01-06 2
# 5 C-02 BURGER 2018-01-08 3
# 6 C-02 BURGER 2018-01-12 3
# 7 C-03 CHICKEN 2018-01-02 1
# 8 C-03 FISH 2018-01-04 2
# 9 C-03 FISH 2018-01-16 2
#10 C-03 FISH 2018-01-20 2
Upvotes: 1