user1626688
user1626688

Reputation: 1653

Create dendrogram using an index (proportion data) as grouping variable, R

I want to create a dendrogram using an index (proportion data) that will show similar clusters. I am trying to decide what distance/similarity metric I have to use so that they represent the original index values.

I have a data frame that looks like this:

 data<-read.table(text="ind  index
T1  0.10
T2  0.11
                 T3 0.01
                 T4 0.64
                 T5 0.03
                 T6 0.15
                 T7 0.26
                 T8 0.06
                 T9 0.01
                 T10    0.004
                 T11    0.01
                 T12    0.19
                 T13    0.04
                 T14    0.69
                 T15    0.06
                 T16    0.51
                 T17    0.15
                 T18    0.26
                 T19    0.26
                 T20    0.01
                 ",header=T)

head(data)

data2<-as.matrix(data[,2])

d<-dist(data2)

# prepare hierarchical cluster
hc = hclust(d)
# very simple dendrogram
plot(hc)

This will produce a simple dendrogram. However, I actually want to use the values from the index column as "my distance". Any suggestions are welcome. Thanks in advance!

Upvotes: 0

Views: 558

Answers (2)

Tal Galili
Tal Galili

Reputation: 25376

You can use the cophenetic function to extract the distance matrix of the hclust object. With that, you can check how well your dendrogram is representing your original distance function (by checking the correlation between your original distance to the cophenetic distance from the dendrogram). For example:

> hc <- hclust(d, method="single")
> cor(d, cophenetic(hc))
[1] 0.9270891
> hc <- hclust(d, method="complete")
> cor(d, cophenetic(hc))
[1] 0.9249611

This tells you that "single" method is a tiny bit better than "complete", but that neither of the two are able to fully capture the original distance matrix (since their correlation is not 1).

I hope this helps.

Upvotes: 1

lawyeR
lawyeR

Reputation: 7674

Perhaps this will help? Your values are on the y-axis.

hc <- hclust(d = d, method="single", members=NULL)
library(ggdendro)
ggdendrogram(hc, theme_dendro=FALSE)

enter image description here

Upvotes: 1

Related Questions