Josh Cho
Josh Cho

Reputation: 159

R: Which products are bought together more frequently?

I have a customer data that lists what they buy. It only consider whether they buy it or not. It does not count how many they buy. (so 0 or 1 for entries)

   apple banana corn
1  1     0      1
2  0     1      0
3  1     1      1
4  0     0      0
5  1     0      1

I would like to know which of those products are likely to be sold together.

What i can think of is to consider all permutation, create variables accordingly and sums individuals up.

    a   a_b  a_c  b   b_c  a_b_c
1   0   0    1    0   0    0
2   0   0    0    1   0    0
3   0   0    0    0   0    1
4   0   0    0    0   0    0
5   0   0    1    0   0    0
sum 0   0    2    1   0    1

so in this little example people like to buy apple and corn together.

I tried doing in below way, but it became very cumbersome for the repeated code....

allCombs <- function(x) c(x, lapply(seq_along(x)[-1L], function(y) combn(x, y, paste0, collapse = "_")),recursive = TRUE)
name = c("a","b","c")
for (i in allCombs(name)) {
  df[,i] = 0
}
condition_1 = df[,"apple"] == 1
condition_2 = df[,"banana"] == 1
condition_3 = df[,"corn"] == 1
df[condition_1 & !condition_2 & !condition_3, "a"] = 1
df[condition_1 & condition_2 & !condition_3, "a_b"] = 1
...

I would like to know how this could be done in more efficient/better way, handling all permutation.

Moreover, if there are better ways to solve this problem, any insight will be very helpful. Since as product category increases, the number of permutation grows very quickly, and the result matrix is very sparse which i think not a good idea.

Here is a sample data

df <- structure(list(a = c(1,0,1,0,1), b = c(0,1,1,0,0), c = c(1,0,1,0,1), class = "data.frame")

Thanks in advance!

Upvotes: 1

Views: 384

Answers (1)

Roland
Roland

Reputation: 132651

This can be solved with a simple call to table:

DF <- read.table(text = "   apple banana corn
1  1     0      1
2  0     1      0
3  1     1      1
4  0     0      0
5  1     0      1", header = TRUE)

as.data.frame(do.call(table, DF))
#  apple banana corn Freq
#1     0      0    0    1
#2     1      0    0    0
#3     0      1    0    1
#4     1      1    0    0
#5     0      0    1    0
#6     1      0    1    2
#7     0      1    1    0
#8     1      1    1    1

Indeed, apple + corn - banana is most common. However, this does not tell you "which of those products are likely to be sold together" because for this you'd need to collapse to pairwise frequencies. That could be done like this:

pairs1 <- combn(DF, 2, function(x) x[1] == 1 & x[1] == x[2], simplify = FALSE)
pairs2 <- combn(names(DF), 2)
paircounts <- data.frame(t(pairs2), freq = sapply(pairs1, sum))
#      X1     X2 freq
#1  apple banana    1
#2  apple   corn    3
#3 banana   corn    1

It wouldn't be efficient for many products because the number of combinations gets huge quickly. I think you should be able to do something clever with package igraph instead but I'm not good enough with graphs to solve this quickly.

Upvotes: 4

Related Questions