Reputation: 1653
I want to create a dendrogram using an index (proportion data) that will show similar clusters. I am trying to decide what distance/similarity metric I have to use so that they represent the original index values.
I have a data frame that looks like this:
data<-read.table(text="ind index
T1 0.10
T2 0.11
T3 0.01
T4 0.64
T5 0.03
T6 0.15
T7 0.26
T8 0.06
T9 0.01
T10 0.004
T11 0.01
T12 0.19
T13 0.04
T14 0.69
T15 0.06
T16 0.51
T17 0.15
T18 0.26
T19 0.26
T20 0.01
",header=T)
head(data)
data2<-as.matrix(data[,2])
d<-dist(data2)
# prepare hierarchical cluster
hc = hclust(d)
# very simple dendrogram
plot(hc)
This will produce a simple dendrogram. However, I actually want to use the values from the index column as "my distance". Any suggestions are welcome. Thanks in advance!
Upvotes: 0
Views: 558
Reputation: 25376
You can use the cophenetic
function to extract the distance matrix of the hclust object. With that, you can check how well your dendrogram is representing your original distance function (by checking the correlation between your original distance to the cophenetic distance from the dendrogram). For example:
> hc <- hclust(d, method="single")
> cor(d, cophenetic(hc))
[1] 0.9270891
> hc <- hclust(d, method="complete")
> cor(d, cophenetic(hc))
[1] 0.9249611
This tells you that "single" method is a tiny bit better than "complete", but that neither of the two are able to fully capture the original distance matrix (since their correlation is not 1).
I hope this helps.
Upvotes: 1
Reputation: 7674
Perhaps this will help? Your values are on the y-axis.
hc <- hclust(d = d, method="single", members=NULL)
library(ggdendro)
ggdendrogram(hc, theme_dendro=FALSE)
Upvotes: 1