ira
ira

Reputation: 2644

How to find number of times items and their combinations were purchased?

I have a data.table showing which items a customer purchased. Each row represents one customer and each column one item. The table has same number of columns for each customer and the values in columns item* are either 1 or 0 based on whether customer purchased a given item. Simple version of the table looks like this:

data.table(customerID = c(1,2,3,4,5),
           item1 = c(1,0,0,1,1),
           item2 = c(1,0,1,1,1),
           item3 = c(1,0,0,0,1),
           item4 = c(0,1,1,1,1))

The table says that customer 1 purchased items 1,2,3 and item 3 was purchased by customers 1 and 5.

In real case the data.table has so many columns it would be impractical to refer to them by name in the code, but it would be ok to have the data in long format instead.

I need to find out how many times individual items were purhcased and how many times their combinations were purchased. In this case i would like to get something like:

item1 3
item2 4
item3 2
item4 4
item1;item2 3
item1;item3 2
item1;item4 1
...
(same for other combinations of length 2)
...
item1;item2;item3 2
item1;item2;item4 1

...
up to combinations of 4 items.

Moreover, i'll need for each customer a table indicating which combinations of products he or she purchased.

Edit:

Thanks to three very useful answers, i know how to answer the first part of the question - i.e. calculate how many customers purchased certain combination. However, the second part remains unanswered. I would like to know which customers purchased which combination.

Upvotes: 3

Views: 754

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 388817

This is a completely base R option so converting the data to dataframe

df <- data.frame(df)
unique_product <- names(df[-1])

stack(unlist(sapply(seq_along(unique_product), function(x) 
     combn(unique_product, x, FUN = function(y) 
           setNames(sum(rowSums(df[y] == 1) == length(y)), 
            paste0(y, collapse = ";")), simplify = FALSE))))


#   values                     ind
#1       3                   item1
#2       4                   item2
#3       2                   item3
#4       4                   item4
#5       3             item1;item2
#6       2             item1;item3
#7       2             item1;item4
#8       2             item2;item3
#9       3             item2;item4
#10      1             item3;item4
#11      2       item1;item2;item3
#12      2       item1;item2;item4
#13      1       item1;item3;item4
#14      1       item2;item3;item4
#15      1 item1;item2;item3;item4

We create all combination of each unique product with combn and for every combination count how many of them occur together by subsetting the respective columns from dataframe.


To get the customers which unlocked certain combinations we can continue the same approach

stack(unlist(sapply(seq_along(unique_product), function(x) 
     combn(unique_product, x, FUN = function(y) {
      inds <- rowSums(df[x] == 1) == length(x)
      setNames(df$customerID[inds], 
             rep(paste0(y, collapse = ";"), sum(inds)))
             }, simplify = FALSE))))

#   values                     ind
#1       1                   item1
#2       1                   item2
#3       1                   item3
#4       1                   item4
#5       1             item1;item2
#6       4             item1;item2
#7       5             item1;item2
#8       1             item1;item3
#9       4             item1;item3
#10      5             item1;item3
#....

You can rename the columns if needed but here values are the Customer Id's and ind are the combinations which the respective customer unlocked.

Upvotes: 1

s_baldur
s_baldur

Reputation: 33498

Here is some dirty code that allows you to set the parameter n_items which controls the maximum size of a bundle:

library(magrittr)
DT_melt <- DT[, melt(.SD, id.vars = "customerID", variable.factor = FALSE)
              ][value == 1
                ][, variable := as.integer(sub("item", "", variable))]
n_items <- 4L
keep_track <- list()
for (i in seq_len(n_items)) {
  combs <- combn(seq_len(n_items), i)
  keep_track[[i]] <- apply(combs, 2, function(x)  DT_melt[, all(x %in% variable), by = customerID]) %>%
    lapply(function(x) sum(x[[2]])) %>% 
    setNames(apply(combs, 2, function(x) paste(paste0("item", x), collapse = ";")))
}
unlist(keep_track)

Returns a named vector of counts:

#                   item1                   item2 
#                       3                       4 
#                   item3                   item4 
#                       2                       4 
#             item1;item2             item1;item3 
#                       3                       2 
#             item1;item4             item2;item3 
#                       2                       2 
#             item2;item4             item3;item4 
#                       3                       1 
#       item1;item2;item3       item1;item2;item4 
#                       2                       2 
#       item1;item3;item4       item2;item3;item4 
#                       1                       1 
# item1;item2;item3;item4 
#                       1 

Upvotes: 1

Wimpel
Wimpel

Reputation: 27732

A step by step approach using baseR and data.table

sample data

DT <- data.table(customerID = c(1,2,3,4,5),
           item1 = c(1,0,0,1,1),
           item2 = c(1,0,1,1,1),
           item3 = c(1,0,0,0,1),
           item4 = c(0,1,1,1,1))

code

#identify columns with items, grab their names
cols <- names(DT[,-1])

in code below: set 1:length(cols) to 1:n if you want combinations of maximum of n products

#put all combinations of items in a list
combos <- unlist( lapply( 1:length(cols), combn, x = cols, simplify = FALSE ), recursive = FALSE )

#calculate number of sold items per combo
l <- lapply( combos, function(x) {
  nrow( DT[ rowSums( DT[, x, with = FALSE ] ) == length( x ), ] )
})

#name the list based on the combo
names(l) <- lapply( combos, paste0, collapse = ";")

output

str( l )
List of 15
$ item1                  : int 3
$ item2                  : int 4
$ item3                  : int 2
$ item4                  : int 4
$ item1;item2            : int 3
$ item1;item3            : int 2
$ item1;item4            : int 2
$ item2;item3            : int 2
$ item2;item4            : int 3
$ item3;item4            : int 1
$ item1;item2;item3      : int 2
$ item1;item2;item4      : int 2
$ item1;item3;item4      : int 1
$ item2;item3;item4      : int 1
$ item1;item2;item3;item4: int 1

or create a data.table

as.data.table( as.matrix( unlist(l), ncol = 2, nrow = length(l) ), keep.rownames = TRUE )
#                         rn V1
# 1:                   item1  3
# 2:                   item2  4
# 3:                   item3  2
# 4:                   item4  4
# 5:             item1;item2  3
# 6:             item1;item3  2
# 7:             item1;item4  2
# 8:             item2;item3  2
# 9:             item2;item4  3
#10:             item3;item4  1
#11:       item1;item2;item3  2
#12:       item1;item2;item4  2
#13:       item1;item3;item4  1
#14:       item2;item3;item4  1
#15: item1;item2;item3;item4  1

Upvotes: 3

Related Questions