Documentation or code sample for agglomerative clustering

Question

for my undergraduate research project, I'm looking for an R code for agglomerative clustering. Basically, I need to know what happened inside hclust method in R. I have looked everywhere but don't find a proper way to combine 2 data points that have the smallest distance. I was stuck at developing a dissimilarity matrix after the first phase of dissimilarity matrix generation(literally I have generated the first dissimilarity matrix). I'm not specifying R, if anyone could give me a solution in any language I would gratefully accept that.

G5W · Accepted Answer

When you use hclust in R, all of the information that you need is stored in the output of the function. You can just save the output and read off what you need. However, the output may not be completely transparent, so I will go through a very small example and you can apply the ideas to your data.

To get a small example, I will randomly sample eight points from the built-in iris data and then use hclust on those eight points. Since I don't specify anything different, 'hclust` uses the default complete-link clustering.

set.seed(2021)
S1 = sort(sample(150, 8))
Tree1 = hclust(dist(iris[S1,1:4]))
plot(Tree1, hang=-1)

As I said, the output of hclust contains what you need. It is stored in Tree1. Actually, it contains more than I think you are asking about. You can see everything in there by running str(Tree1) but for now, I am going to concentrate on two parts of that structure: Tree1$merge and Tree1$height.

Tree1$merge
     [,1] [,2]
[1,]   -4   -8
[2,]   -5    1
[3,]   -3   -7
[4,]   -1   -2
[5,]   -6    2
[6,]    3    4
[7,]    5    6

What do these mean? First, Tree1$merge tells us the order in which clusters were merged. The first row [1,] -4 -8 tells us that the first step is to merge points 4 (labeled 103) and 8 (labeled 140) to form cluster 1. (The points get negative numbers, the clusters formed from them get positive numbers.) The next row [2,] -5 1 tells us that the second step merges point 5 (with label 105) with cluster 1 that we formed above from points 4 and 8. The remaining rows show the remaining merge steps. For example, the last step merges clusters 5 and 6.

OK, so now we see the order that the clusters were merged, but what were the distances? Why this order? We get that from Tree1$height although of course that came from the original distance matrix. So let's also look at the distance matrix.

Tree1$height
[1] 0.5477226 0.6164414 0.7745967 0.9848858 1.0148892 1.8000000 3.2511536

dist(iris[S1,1:4])
           69        70       102       103       105       110       135
70  0.9848858                                                            
102 0.9643651 1.4696938                                                  
103 1.9416488 2.7386128 1.5684387                                        
105 1.7058722 2.4248711 1.0770330 0.6164414                              
110 2.5534291 3.2511536 2.0322401 0.7549834 1.0148892                    
135 1.1789826 1.8000000 0.7745967 1.3190906 1.0000000 1.9157244          
140 1.5716234 2.3021729 1.2247449 0.5477226 0.5830952 0.9949874 1.1916375

Notice that the first entry in Tree1$height is 0.5477226, the distance between the fourth point (103) and the eighth point (140). This was the minimum distance in the distance matrix, which is why those points were merged first. Recall that the next merge was point 5 (105) with cluster 1. When we use complete-link, how is that distance computed? It is the greatest distance between point 5 and any point in cluster 1. We can see from the distance matrix that the distance between point 4 and 5 is 0.6164414 and the distance between point 8 and point 5 is 0.5830952, so the distance between point 5 and cluster 1 is 0.6164414 (the biggest value for complete link). Looking at the rest of the distance matrix, we can see that no two points have any distance smaller than this, so the second merge is point 5 with cluster 1 at a distance of 0.6164414. The third row of Tree1$merge tells us that at the third step, we merge points 3 and 7 (labels 102 and 135). The third entry of Tree1$height tells us that the distance for this merger was 0.7745967.

Documentation or code sample for agglomerative clustering

Answers (1)

Related Questions