sqlbeginner
sqlbeginner

Reputation: 71

In R, compare one column value to all other columns

I'm very new to R and I have a question which might be very simple for experts here.

Let's say i have a table "sales", which includes 4 customer IDs (123-126) and 4 products (A,B,C,D).

ID  A   B   C   D
123 0   1   1   0
124 1   1   0   0
125 1   1   0   1
126 0   0   0   1

I want to calculate the overlaps between products. So for A, the number of IDs that have both A and B will be 2. Similarly, the overlap between A and C will be 0 and that between A and D will be 1. Here is my code for A and B overlap:

overlap <- sales [which(sales [,"A"] == 1 & sales [,"B"] == 1 ),]
countAB <- count(overlap,"ID")

I want to repeat this calculation for all 4 products,so A overlaps with B,C,D and B overlaps with A,C,D, etc...How can i change the code to accomplish this?

I want the final output to be the number of IDs for each two-product combination. It's product affinity exercise and i want to find out for one product, which product sold the most with it. For example, for A, the most sold products with it will be B, followed by D, then C. Some sorting needs to be added to the code to get to this i think.

Thanks for your help!

Upvotes: 3

Views: 3064

Answers (4)

Marat Talipov
Marat Talipov

Reputation: 13304

You can use matrix multiplication:

m <- as.matrix(d[-1])
z <- melt(crossprod(m,m))
z[as.integer(z$X1) < as.integer(z$X2),]
#    X1 X2 value
# 5   A  B     2
# 9   A  C     0
# 10  B  C     1
# 13  A  D     1
# 14  B  D     1
# 15  C  D     0

where d is your data frame:

d <- structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L, 1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID", "A", "B", "C", "D"), class = "data.frame", row.names = c(NA, -4L))

[Update]

To calculate the product affinity, you can do:

z2 <- subset(z,X1!=X2)
do.call(rbind,lapply(split(z2,z2$X1),function(d) d[which.max(d$value),]))
#   X1 X2 value
# A  A  B     2
# B  B  A     2
# C  C  B     1
# D  D  A     1

Upvotes: 2

digEmAll
digEmAll

Reputation: 57210

Here's a possible solution :

sales <- 
read.csv(text=
"ID,A,B,C,D
123,0,1,1,0
124,1,1,0,0
125,1,1,0,1
126,0,0,0,1")

# get product names
prods <- colnames(sales)[-1]
# generate all products pairs (and transpose the matrix for convenience)
combs <- t(combn(prods,2))

# turn the combs into a data.frame with column P1,P2
res <- as.data.frame(combs)
colnames(res) <- c('P1','P2')  

# for each combination row :
# - subset sales selecting only the products in the row
# - count the number of rows summing to 2 (if sum=2 the 2 products have been sold together)
#   N.B.: length(which(logical_condition)) can be implemented with sum(logical_condition) 
#         since TRUE and FALSE are automatically coerced to 1 and 0
# finally add the resulting vector to the newly created data.frame
res$count <- apply(combs,1,function(comb){sum(rowSums(sales[,comb])==2)})

> res
  P1 P2 count
1  A  B     2
2  A  C     0
3  A  D     1
4  B  C     1
5  B  D     1
6  C  D     0

Upvotes: 2

Metrics
Metrics

Reputation: 15458

    #x1 is your dataframe
x1<-structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L, 
1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID", 
"A", "B", "C", "D"), class = "data.frame", row.names = c(NA, 
-4L))
#get the combination of all colnames but the first ("ID")
    k1<-combn(colnames(x1[,-1]),2)
#create two lists a1 and a2 so that we can iterate over each element 
    a1<-as.list(k1[seq(1,length(k1),2)])
    a2<-as.list(k1[seq(2,length(k1),2)])
# your own functions with varying i and j
     mapply(function(i,j) length(x1[which(x1[,i] == 1 & x1 [,j] == 1 ),1]),a1,a2)
    [1] 2 0 1 1 1 0

Upvotes: 2

untitled
untitled

Reputation: 152

You might want to take a look at the arules package. It does exactly what you are looking for. Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt.

Upvotes: 1

Related Questions