user1375640
user1375640

Reputation: 151

Cluster binary matrix in R

I have a binary matrix between 2 variables. I would like to know if there is a way to cluster the binary matrix in R. If so, which algorithm should I be using?

The matrix looks like this

        hobby1  hobby2  hobby3  hobby4
person1   1       0       0       1
person2   0       1       0       1
person3   1       1       1       0
person4   0       1       1       1

So clustering those persons by the most common hobbies they have. What is the best method to do it?

Thanks

Upvotes: 3

Views: 3681

Answers (2)

akiwi
akiwi

Reputation: 13

Are you wondering what is a useful similarity/dissimilarity metric for clustering binary data? There is the Jaccard index/coefficient, which is

(size of intersection) / (size of union)

a.k.a. (# of shared 1's) / (# of columns where one of the two rows has a 1). The corresponding Jaccard distance would be 1 - the Jaccard index. There is also the simple matching coefficient, which is

(size of intersection) / (length of vectors)

I'm sure there are other distance metrics proposed for binary data. This really is a statistics question so you should consult a book on that subject.

In R specifically, you can use dist(x, method="binary"), in which case I believe the Jaccard index is used. You then use the distance matrix object dist.obj in your choice of a clustering algorithm (e.g. hclust).

Upvotes: 0

Troy
Troy

Reputation: 8691

How about crossprod() and reshape2::melt():

# CREATE THE MATRIX
m.h<-(matrix(sample(0:1,200,T),nrow=20))

# CREATE CROSS_PRODUCT
m.cross<-matrix(unlist(lapply(1:nrow(m.h),function(x)crossprod(m.h[x,],t(m.h)))),nrow=nrow(m.h),byrow=T)

# USE reshape2 to melt/flatten the data
require(reshape2)
m.long<-melt(m.cross)
m.long[order(m.long$value,factor(m.long$Var2),factor(m.long$Var1)),]

require(ggplot2)
ggplot(m.long)+
  geom_tile(aes(Var1,Var2,fill=value))+
  geom_text(aes(Var1,Var2,label=value))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  scale_fill_gradient(low="yellow",high="red") +
  scale_x_discrete(breaks = 1:nrow(m.h), labels=unlist(lapply(1:nrow(m.h),function(x)paste0("Person ",x)))) + 
  scale_y_discrete(breaks = 1:nrow(m.h), labels=unlist(lapply(1:nrow(m.h),function(x)paste0("Person ",x)))) +
  coord_cartesian(xlim=c(0,nrow(m.h)+1),ylim=c(0,nrow(m.h)+1))  

enter image description here

Upvotes: 1

Related Questions