Reputation: 71
I'm very new to R and I have a question which might be very simple for experts here.
Let's say i have a table "sales", which includes 4 customer IDs (123-126) and 4 products (A,B,C,D).
ID A B C D
123 0 1 1 0
124 1 1 0 0
125 1 1 0 1
126 0 0 0 1
I want to calculate the overlaps between products. So for A, the number of IDs that have both A and B will be 2. Similarly, the overlap between A and C will be 0 and that between A and D will be 1. Here is my code for A and B overlap:
overlap <- sales [which(sales [,"A"] == 1 & sales [,"B"] == 1 ),]
countAB <- count(overlap,"ID")
I want to repeat this calculation for all 4 products,so A overlaps with B,C,D and B overlaps with A,C,D, etc...How can i change the code to accomplish this?
I want the final output to be the number of IDs for each two-product combination. It's product affinity exercise and i want to find out for one product, which product sold the most with it. For example, for A, the most sold products with it will be B, followed by D, then C. Some sorting needs to be added to the code to get to this i think.
Thanks for your help!
Upvotes: 3
Views: 3064
Reputation: 13304
You can use matrix multiplication:
m <- as.matrix(d[-1])
z <- melt(crossprod(m,m))
z[as.integer(z$X1) < as.integer(z$X2),]
# X1 X2 value
# 5 A B 2
# 9 A C 0
# 10 B C 1
# 13 A D 1
# 14 B D 1
# 15 C D 0
where d
is your data frame:
d <- structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L, 1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID", "A", "B", "C", "D"), class = "data.frame", row.names = c(NA, -4L))
[Update]
To calculate the product affinity, you can do:
z2 <- subset(z,X1!=X2)
do.call(rbind,lapply(split(z2,z2$X1),function(d) d[which.max(d$value),]))
# X1 X2 value
# A A B 2
# B B A 2
# C C B 1
# D D A 1
Upvotes: 2
Reputation: 57210
Here's a possible solution :
sales <-
read.csv(text=
"ID,A,B,C,D
123,0,1,1,0
124,1,1,0,0
125,1,1,0,1
126,0,0,0,1")
# get product names
prods <- colnames(sales)[-1]
# generate all products pairs (and transpose the matrix for convenience)
combs <- t(combn(prods,2))
# turn the combs into a data.frame with column P1,P2
res <- as.data.frame(combs)
colnames(res) <- c('P1','P2')
# for each combination row :
# - subset sales selecting only the products in the row
# - count the number of rows summing to 2 (if sum=2 the 2 products have been sold together)
# N.B.: length(which(logical_condition)) can be implemented with sum(logical_condition)
# since TRUE and FALSE are automatically coerced to 1 and 0
# finally add the resulting vector to the newly created data.frame
res$count <- apply(combs,1,function(comb){sum(rowSums(sales[,comb])==2)})
> res
P1 P2 count
1 A B 2
2 A C 0
3 A D 1
4 B C 1
5 B D 1
6 C D 0
Upvotes: 2
Reputation: 15458
#x1 is your dataframe
x1<-structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L,
1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID",
"A", "B", "C", "D"), class = "data.frame", row.names = c(NA,
-4L))
#get the combination of all colnames but the first ("ID")
k1<-combn(colnames(x1[,-1]),2)
#create two lists a1 and a2 so that we can iterate over each element
a1<-as.list(k1[seq(1,length(k1),2)])
a2<-as.list(k1[seq(2,length(k1),2)])
# your own functions with varying i and j
mapply(function(i,j) length(x1[which(x1[,i] == 1 & x1 [,j] == 1 ),1]),a1,a2)
[1] 2 0 1 1 1 0
Upvotes: 2
Reputation: 152
You might want to take a look at the arules package. It does exactly what you are looking for. Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt.
Upvotes: 1