fmarm
fmarm

Reputation: 4284

Create "contingency" table with multi-rows

Let's consider this dataset, where the first field is a bill number and the second one is the name of a product :

df=data.frame(bill=c(1,1,1,1,2,2,2,2,3,3),product=c("A","B","C","B","A","C","E","D","C","D"))

I would like to count the number of bills containing each combination of two products, for example in this case a result like this (I don't want to keep combinations where count is 0) :

#  prod1 prod2 count
#   A     B     1
#   A     C     2
#   A     D     1
#   A     E     1
#   B     C     1
#   C     D     2
#   C     E     1
#   D     E     1

I have a solution with loops but it's really not pretty (and slow !):

products=sort(unique(df$product))
bills_list=list()
for (i in 1:length(products)){
   bills_list[[i]]=unique(df[which(df$product==products[i]),"bill"])
}
df2=data.frame(prod1=character(0),prod2=character(0),count=numeric(0))
for (i in 1:(length(products)-1)){
   for (j in (i+1):length(products)){
       Nij=length(intersect(bills_list[[i]],bills_list[[j]]))
       if (Nij>0){
         temp=data.frame(prod1=products[i],prod2=products[j],count=Nij)
         df2=rbind(df2,temp)
       }
   }
}

Is there a way to do this without loops ? Thank you for your time.

Upvotes: 1

Views: 427

Answers (4)

user3229754
user3229754

Reputation: 111

res <- table(df$bill, df$product)
##> res
##   
##    A B C D E
##  1 1 2 1 0 0
##  2 1 0 1 1 1
##  3 0 0 1 1 0

res2 <- ifelse(res > 0, 1, 0)
##> res2
##   
##    A B C D E
##  1 1 1 1 0 0
##  2 1 0 1 1 1
##  3 0 0 1 1 0

cor(res2)
##
##     A    B  C    D    E
##A  1.0  0.5 NA -0.5  0.5
##B  0.5  1.0 NA -1.0 -0.5
##C   NA   NA  1   NA   NA
##D -0.5 -1.0 NA  1.0  0.5
##E  0.5 -0.5 NA  0.5  1.0
##Warning message:
##In cor(res2) : the standard deviation is zero

I do realize that this does not answer the question that you asked.

But, it may get you closer to the answer that, presumably, you seek. Namely, what is the impact of a customer ordering one product on the likelihood (positive or negative) that will order one of the others.

Upvotes: 0

Tyler Rinker
Tyler Rinker

Reputation: 109864

Here's another approach:

library(qdap)

dat <- unlist(lapply(split(df$product, df$bill), function(x) {
    y <- outer(unique(x), unique(x), paste)
    unlist(y[upper.tri(y)])
}))

dat2 <- data.frame(table(dat), stringsAsFactors = FALSE)

colsplit2df(dat2, sep=" ", new.names=paste0("prod", 1:2))

##   prod1 prod2 Freq
## 1     A     B    1
## 2     A     C    2
## 3     A     D    1
## 4     A     E    1
## 5     B     C    1
## 6     C     D    2
## 7     C     E    1
## 8     E     D    1

Upvotes: 1

shadow
shadow

Reputation: 22293

Here's a solution with plyr and data.table.

# needed packages
require(plyr)
require(data.table)
# find the combinations in each of the bills
combs <- ddply(df, .(bill), function(x){
  t(combn(unique(as.character(x$product)),2))
})
colnames(combs) <- c("bill", "prod1", "prod2")
# combine these 
res <- data.table(combs, key=c("prod1", "prod2"))[, .N, by=list(prod1, prod2)]

Upvotes: 3

Jake Burkhead
Jake Burkhead

Reputation: 6535

library(reshape2)

df$product <- as.character(df$product)
products <- t(combn(unique(df$product), 2))
dat <- dcast(bill ~ product, data = df)
##   bill A B C D E
## 1    1 1 2 1 0 0
## 2    2 1 0 1 1 1
## 3    3 0 0 1 1 0


out <- structure(
  data.frame(products, apply(products, 1, function(x) sum(rowSums(dat[x] > 0) == 2) )),
  names = c("prod1", "prod2", "count")
  )

out[out$count != 0,]
##    prod1 prod2 count
## 1      A     B     1
## 2      A     C     2
## 3      A     E     1
## 4      A     D     1
## 5      B     C     1
## 8      C     E     1
## 9      C     D     2
## 10     E     D     1

Upvotes: 1

Related Questions