Reputation: 33
for my undergraduate research project, I'm looking for an R code for agglomerative clustering. Basically, I need to know what happened inside hclust method in R. I have looked everywhere but don't find a proper way to combine 2 data points that have the smallest distance. I was stuck at developing a dissimilarity matrix after the first phase of dissimilarity matrix generation(literally I have generated the first dissimilarity matrix). I'm not specifying R, if anyone could give me a solution in any language I would gratefully accept that.
Upvotes: 1
Views: 123
Reputation: 37641
When you use hclust
in R, all of the information that you need is stored in the output of the function. You can just save the output and read off what you need. However, the output may not be completely transparent, so I will go through a very small example and you can apply the ideas to your data.
To get a small example, I will randomly sample eight points from the built-in iris data and then use hclust
on those eight points. Since I don't specify anything different, 'hclust` uses the default complete-link clustering.
set.seed(2021)
S1 = sort(sample(150, 8))
Tree1 = hclust(dist(iris[S1,1:4]))
plot(Tree1, hang=-1)
As I said, the output of hclust
contains what you need. It is stored in Tree1. Actually, it contains more than I think you are asking about. You can see everything in there by running str(Tree1)
but for now, I am going to concentrate on two parts of that structure: Tree1$merge
and Tree1$height
.
Tree1$merge
[,1] [,2]
[1,] -4 -8
[2,] -5 1
[3,] -3 -7
[4,] -1 -2
[5,] -6 2
[6,] 3 4
[7,] 5 6
What do these mean? First, Tree1$merge
tells us the order in which clusters were merged. The first row [1,] -4 -8
tells us that the first step is to merge points 4 (labeled 103) and 8 (labeled 140) to form cluster 1. (The points get negative numbers, the clusters formed from them get positive numbers.) The next row [2,] -5 1
tells us that the second step merges point 5 (with label 105) with cluster 1 that we formed above from points 4 and 8. The remaining rows show the remaining merge steps. For example, the last step merges clusters 5 and 6.
OK, so now we see the order that the clusters were merged, but what were the distances? Why this order? We get that from Tree1$height
although of course that came from the original distance matrix. So let's also look at the distance matrix.
Tree1$height
[1] 0.5477226 0.6164414 0.7745967 0.9848858 1.0148892 1.8000000 3.2511536
dist(iris[S1,1:4])
69 70 102 103 105 110 135
70 0.9848858
102 0.9643651 1.4696938
103 1.9416488 2.7386128 1.5684387
105 1.7058722 2.4248711 1.0770330 0.6164414
110 2.5534291 3.2511536 2.0322401 0.7549834 1.0148892
135 1.1789826 1.8000000 0.7745967 1.3190906 1.0000000 1.9157244
140 1.5716234 2.3021729 1.2247449 0.5477226 0.5830952 0.9949874 1.1916375
Notice that the first entry in Tree1$height
is 0.5477226, the distance between the fourth point (103) and the eighth point (140). This was the minimum distance in the distance matrix, which is why those points were merged first. Recall that the next merge was point 5 (105) with cluster 1. When we use complete-link, how is that distance computed? It is the greatest distance between point 5 and any point in cluster 1. We can see from the distance matrix that the distance between point 4 and 5 is 0.6164414 and the distance between point 8 and point 5 is 0.5830952, so the distance between point 5 and cluster 1 is 0.6164414 (the biggest value for complete link). Looking at the rest of the distance matrix, we can see that no two points have any distance smaller than this, so the second merge is point 5 with cluster 1 at a distance of 0.6164414. The third row of Tree1$merge
tells us that at the third step, we merge points 3 and 7 (labels 102 and 135). The third entry of Tree1$height
tells us that the distance for this merger was 0.7745967.
Upvotes: 1