Reputation: 59
I have a table of categorical values I would like to cluster both by the rows, and by the columns.
Example data: test_dataset.csv
I,II,III,IV,V
A,0,3,3,2,3
B,0,3,3,0,0
C,0,0,3,3,3
D,0,3,1,3,0
E,0,0,3,0,0
The levels are "no data", "no increase", "mixed",
and "increase"
.
I found an R package blockcluster
that in theory should be able to do this.
#install.packages("blockcluster")
library(blockcluster)
#0 = no data, 1 = no increase, 2 = mixed, 3 = increase
dataset<-read.table("test_dataset.csv",header = T, sep=',')
out<-coclusterCategorical(as.matrix(dataset),nbcocluster = c(3,2))
summary(out)
plot(out)
This is the resulting plot:
I would like to ask some help regarding how to interpret this plot, if someone has worked with this package before - how do I know which row/column represents what in the co-clustered data?
If I am not mistaken the nbcocluster
parameter determines the resulting clusters row and column wise - how do I know beforehand what is the appropriate amount of clusters?
Is it appropriate to do categorical clustering if one of the categories is essentially missing data?
I am open to suggestions to other methods that can bicluster categorical data. I appreciate any and all help, I have never done this before.
Upvotes: 0
Views: 278
Reputation: 59
For the first question, I figured out the answer (thanks to the forums at InriaForge)
So it doesn't show up on the plot by default, but you can bind the classification results to your original data, e.g.
result_c <-cbind(test_dataset,out@rowclass)
result <- rbind(result_c, out@colclass)
I did not find a solution as to how to select the appropriate amount of clusters and whether it's appropriate to cluster with missing data.
Upvotes: 0