Mendi
Mendi

Reputation: 113

Measure Accuracy in Hierarchical Clustering (Single link) in R

How can I measure accuracy in Hierarchical Clustering (Single link) in R with 2 Clusters ? Here is my code:

> dcdata = read.csv("kkk.txt")
> target = dcdata[,3]
> dcdata = dcdata [,1:2]
> d = dist(dcdata)
> hc_single = hclust(d,method="single")
> plot(hc_single)
> clusters =cutree(hc_single, k=2)
> print(clusters)

Thanks!

Upvotes: 0

Views: 2972

Answers (1)

StupidWolf
StupidWolf

Reputation: 46968

Accuracy is not the most accurate term, but I guess you want to see whether the hierarchical clustering gives you clusters or groups that coincide with your labels. For example, I use the iris dataset, and use setosa vs others as target:

data = iris
target = ifelse(data$Species=="setosa","setosa","others")
table(target)
others setosa 
   100     50

data = data[,1:4]
d = dist(data)
hc_single = hclust(d,method="single")
plot(hc_single)

enter image description here

Seems like they are two major clusters. Now we try to see how the target are distributed:

library(dendextend)
dend <- as.dendrogram(hc_single)
COLS = c("turquoise","orange")
names(COLS) = unique(target)
dend <- color_labels(dend, col = COLS[target[labels(dend)]])
plot(dend) 

enter image description here

Now like what you did, we get the clusters,

clusters =cutree(hc_single, k=2)
table(clusters,target)

            target
    clusters others setosa
           1      0     50
           2    100      0

You get an almost perfect separation. All the data points in cluster 1 are setosa and all in cluster 2 are not setosa. So you can think of it as like 100% accuracy but I would be careful about using the term.

You can roughly calculate the coincidence like this:

Majority_class = tapply(factor(target),clusters,function(i)names(sort(table(i)))[2])

This tells you for each cluster, which is the majority class. And from there we see how much this agrees with the actual labels.

mean(Majority_class[clusters] == target)

Upvotes: 1

Related Questions