emil_rore
emil_rore

Reputation: 115

How to summarize number of products purchased by a customer based on a product list in R

I have two dataframes in R, one consisting of a list of product SKUs (product ids), the other of a purchase log containing the order number, the customer email, the purchase date as well as the product id (product_sku) and quantity purchased.

purchases_dataframe:

order_number | email                | product_sku | quantity | purchase_date
1000         |[email protected]  | RT-100      | 2        | 2020-01-01
1000         |[email protected]  | CT-300      | 1        | 2020-01-01
1000         |[email protected]  | Phone-100   | 1        | 2020-01-01
2000         |[email protected]  | Phone-200   | 1        | 2020-04-20
2000         |[email protected]  | OM-200      | 1        | 2020-04-20
3000         |[email protected]  | CT-300      | 3        | 2020-03-15
4000         |[email protected]  | OM-200      | 5        | 2020-07-07
5000         |[email protected]  | Phone-200   | 3        | 2020-08-19
6000         |[email protected]  | Phone-100   | 1        | 2020-09-22
6000         |[email protected]  | RT-100      | 1        | 2020-09-22

tv_list:

 SKU
    RT-100
    CT-300
    OM-200
    LL-400
    ...

I would like to count the total number of TVs a customer has purchased in his/her lifetime and disregard all other products (e.g. phones). The dataframe tv_list should help me identify which SKUs are TVs and which are not, as I have a variety of different TV SKUs and the above is just a smaller example. The resulting dataframe would ideally look like this:

email                | number_purchased_tv
[email protected] | 8
[email protected] | 1
[email protected] | 4
[email protected] | 0

For reproducability and in order to easier follow my examples, here is the code for the two sample_tables above:

purchase_dataframe <- data.frame(order_number = c(1000,1000,1000, 2000,2000, 3000, 4000, 5000, 6000, 6000),
                      email = c("[email protected]","[email protected]", "[email protected]","[email protected]",
                                "[email protected]","[email protected]","[email protected]","[email protected]",
                                "[email protected]","[email protected]"),
                      product_sku = c("RT-100", "CT-300", "Phone-100", "Phone-200", "OM-200", "CT-300", "OM-200", "Phone-200", "Phone-100", "RT-100"),
                      quantity = c(2,1,1,1,1,3,5,3,1,1),
                      purchase_date = c("2020-01-01","2020-01-01","2020-01-01","2020-04-20","2020-04-20","2020-03-15","2020-07-07","2020-08-19","2020-09-22","2020-09-22"))

tv_list <- data.frame(SKU = c("RT-100", "OM-200", "CT-300", "LL-400", "ZV-700"))

Thanks a lot!

Upvotes: 1

Views: 124

Answers (4)

gaut
gaut

Reputation: 5958

The below does what you requested using dplyr

library(dplyr)
library(data.table)
purchase_dataframe %>% dplyr::group_by(email) %>% dplyr::summarise(sumtv = sum(quantity[product_sku %in% unique(tv_list$SKU)]))
# A tibble: 4 x 2
email                sumtv
<chr>                <dbl>
  1 [email protected]     8
2 [email protected]     1
3 [email protected]     4
4 [email protected]     0

EDIT please find a correction above regarding the sumtv figure and a data.table solution below

library(dplyr)
library(data.table)
purchase_datatable <- purchase_dataframe
purchase_datatable %>% setDT
> purchase_datatable[,sumtv := sum(quantity[product_sku %in% unique(tv_list$SKU)]), by="email"][
  +   ,.(email, sumtv)] %>% unique
email sumtv
1: [email protected]     8
2: [email protected]     1
3: [email protected]     4
4: [email protected]     0

microbenchmarking gives almost a 50% advantage to the data.table solution which IMO is an excellent package well worth learning through these vignettes

library(microbenchmark)
microbenchmark(purchase_datatable[,sumtv := sum(quantity[product_sku %in% unique(tv_list$SKU)]), by="email"][
  ,.(email, sumtv)] %>% unique, purchase_dataframe %>% dplyr::group_by(email) %>% dplyr::summarise(sumtv = sum(quantity[product_sku %in% unique(tv_list$SKU)]))
)
min      lq     mean  median     uq    max neval
1.268 1.42700 1.823445 1.80300 2.0887 2.8332   100
2.715 2.98025 3.250287 3.20355 3.3509 8.8255   100

Upvotes: 3

luis vergara
luis vergara

Reputation: 141

This is the data you provided:

library('dplyr')
purchase_dataframe <- data.frame(order_number = c(1000,1000,1000, 2000,2000, 3000, 4000, 5000, 6000, 6000),
                      email = c("[email protected]","[email protected]", "[email protected]","[email protected]",
                                "[email protected]","[email protected]","[email protected]","[email protected]",
                                "[email protected]","[email protected]"),
                      product_sku = c("RT-100", "CT-300", "Phone-100", "Phone-200", "OM-200", "CT-300", "OM-200", "Phone-200", "Phone-100", "RT-100"),
                      quantity = c(2,1,1,1,1,3,5,3,1,1),
                      purchase_date = c("2020-01-01","2020-01-01","2020-01-01","2020-04-20","2020-04-20","2020-03-15","2020-07-07","2020-08-19","2020-09-22","2020-09-22"))

tv_list <- data.frame(SKU = c("RT-100", "OM-200", "CT-300", "LL-400", "ZV-700"))

This will give you the summary but omit any email(customer that haven't purchase a tv yet)

total_tvs_by_cusomter <- purchase_dataframe %>%
  filter(product_sku %in% tv_list$SKU) %>%
  group_by(email) %>%
  mutate(quantity = as.numeric(quantity)) %>%
  summarise(number_purchased_tv = sum(quantity))

Outcome:

# A tibble: 3 x 2
  email                number_purchased_tv
  <chr>                              <dbl>
1 [email protected]                   8
2 [email protected]                   1
3 [email protected]                   4

and this is in case if you want to keep the emails/customers that haven't yet bought a TV and add them as 0

total_tvs_by_cusomter <- left_join(unique(purchase_dataframe %>%
            select(email)), total_tvs_by_cusomter)

total_tvs_by_cusomter[is.na(total_tvs_by_cusomter)] <- 0

Outcome:

                 email number_purchased_tv
1 [email protected]                   8
2 [email protected]                   1
3 [email protected]                   4
4 [email protected]                   0

Upvotes: 1

Chris
Chris

Reputation: 472

tv_purchases <-
purchase_dataframe %>% 
  group_by(email) %>% 
  filter(product_sku %in% tv_list$SKU) %>%
  summarise(number_purchased_tv = sum(as.numeric(quantity)))

## join tv_purchases on distinct emails, to also have the '[email protected]     0' row

purchase_dataframe %>%
  distinct(email) %>%
  left_join(tv_purchases) %>% ## emails which are not in tv_purchases will have NAs
  mutate(number_purchased_tv = case_when(is.na(number_purchased_tv) ~ 0, ## NAs become zeros
                                         TRUE ~ number_purchased_tv) ## non-NAs stay as they are
         )

Upvotes: 0

Duck
Duck

Reputation: 39623

One option using base R:

#Match and index
purchase_dataframe$ProductIndex <- tv_list[match(purchase_dataframe$product_sku,tv_list$SKU),'SKU']
purchase_dataframe$Counter <- ifelse(is.na(purchase_dataframe$ProductIndex),0,purchase_dataframe$quantity)
#Aggregate
Res <- aggregate(Counter~email,data=purchase_dataframe,sum,na.rm=T)

Output:

                 email Counter
1 [email protected]       8
2 [email protected]       1
3 [email protected]       4
4 [email protected]       0

Upvotes: 1

Related Questions