Reputation: 151
I have a binary matrix between 2 variables. I would like to know if there is a way to cluster the binary matrix in R. If so, which algorithm should I be using?
The matrix looks like this
hobby1 hobby2 hobby3 hobby4
person1 1 0 0 1
person2 0 1 0 1
person3 1 1 1 0
person4 0 1 1 1
So clustering those persons by the most common hobbies they have. What is the best method to do it?
Thanks
Upvotes: 3
Views: 3681
Reputation: 13
Are you wondering what is a useful similarity/dissimilarity metric for clustering binary data? There is the Jaccard index/coefficient, which is
(size of intersection) / (size of union)
a.k.a. (# of shared 1's) / (# of columns where one of the two rows has a 1). The corresponding Jaccard distance would be 1 - the Jaccard index. There is also the simple matching coefficient, which is
(size of intersection) / (length of vectors)
I'm sure there are other distance metrics proposed for binary data. This really is a statistics question so you should consult a book on that subject.
In R specifically, you can use dist(x, method="binary")
, in which case I believe the Jaccard index is used. You then use the distance matrix object dist.obj in your choice of a clustering algorithm (e.g. hclust
).
Upvotes: 0
Reputation: 8691
How about crossprod()
and reshape2::melt()
:
# CREATE THE MATRIX
m.h<-(matrix(sample(0:1,200,T),nrow=20))
# CREATE CROSS_PRODUCT
m.cross<-matrix(unlist(lapply(1:nrow(m.h),function(x)crossprod(m.h[x,],t(m.h)))),nrow=nrow(m.h),byrow=T)
# USE reshape2 to melt/flatten the data
require(reshape2)
m.long<-melt(m.cross)
m.long[order(m.long$value,factor(m.long$Var2),factor(m.long$Var1)),]
require(ggplot2)
ggplot(m.long)+
geom_tile(aes(Var1,Var2,fill=value))+
geom_text(aes(Var1,Var2,label=value))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
scale_fill_gradient(low="yellow",high="red") +
scale_x_discrete(breaks = 1:nrow(m.h), labels=unlist(lapply(1:nrow(m.h),function(x)paste0("Person ",x)))) +
scale_y_discrete(breaks = 1:nrow(m.h), labels=unlist(lapply(1:nrow(m.h),function(x)paste0("Person ",x)))) +
coord_cartesian(xlim=c(0,nrow(m.h)+1),ylim=c(0,nrow(m.h)+1))
Upvotes: 1