Reputation: 95
I'm new to this site as well as new to cluster analysis, so I apologize if I violate conventions.
I've been using Cluster 3.0 to perform Hierarchical Cluster Analysis with Euclidean Distance and Average linkage. Cluster 3.0 outputs a .gtr file with a node joining a gene and their similarity score. I've noticed that the first line in the .gtr file always links a gene with another gene followed by the similarity score. But, how do I reproduce this similarity score?
In my data set, I have 8 genes and create a distance matrix where d_{ij} contains the Euclidian distance between gene i and gene j. Then I normalize the matrix by dividing each element by the max value in the matrix. To get the similarity matrix, I subtract all the elements from 1. However, my result does not use the linkage type and differs from the output similarity score.
I am mainly confused how linkages affect the similarity of the first node (the joining of the two closest genes) and how to compute the similarity score.
Thank you!
Upvotes: 0
Views: 506
Reputation: 2123
The algorithm compares clusters using some linkage method, not data points. However, in the first iteration of the algorithm each data point forms its own cluster; this means that your linkage method is actually reduced to the metric you use to measure the distance between data points (for your case Euclidean distance). For subsequent iterations, the distance between clusters will be measured according to your linkage method, which in your case is average link. For two clusters A and B, this is calculated as follows:
where d(a,b)
is the Euclidean distance between the two data points. Convince yourself that when A and B contain just one data point (as in the first iteration) this equation reduces itself to d(a,b)
. I hope this makes things a bit more clear. If not, please provide more details of what exactly you want to do.
Upvotes: 1