kaitokid
kaitokid

Reputation: 57

How to obtain the height of tree in cutree() knowing the number of clusters

I am using hierarchical clustering to classify my data.

I would like to define the optimal number of clusters. To do so, the idea is to visualize a graph that the x-axis is the number of clusters, and the y-axis is the height of the tree in the dendrogram.

And to do so, I need to know the height of the tree when the number of clusters K is specified, for example if K=4, I need to know the height of tree after the command

cutree(hclust(dist(data), method = "ward.D"), k = 4) 

Can someone help please?

Upvotes: 1

Views: 2991

Answers (2)

Yun
Yun

Reputation: 195

  cutree_k_to_h <- function(tree, k) {
    if (is.null(n1 <- nrow(tree$merge)) || n1 < 1) {
      cli::cli_abort("invalid {.arg tree} ({.field merge} component)")
    }
    n <- n1 + 1
    if (is.unsorted(tree$height)) {
      cli::cli_abort(
        "the 'height' component of 'tree' is not sorted (increasingly)"
      )
    }
    mean(tree$height[c(n - k, n - k + 1L)])
  }
  tree <- hclust(dist(iris[, 1:4]), method = "ward.D")
  plot(tree)
  abline(h = cutree_k_to_h(tree, 3), col = "red")

enter image description here

Upvotes: 0

G5W
G5W

Reputation: 37641

The heights are stored in the hclust object. Since you do not provide any data, I will illustrate with the built-in iris data.

HC = hclust(dist(iris[,1:4]), method="ward.D")
sort(HC$height)
<reduced output>
[133]   1.8215623   1.8787489   1.9240172   1.9508686   2.5143038   2.7244855
[139]   2.9123706   3.1111893   3.2054610   3.9028695   4.9516315   6.1980126
[145]   9.0114060  10.7530460  18.2425079  44.1751473 199.6204659

The biggest value is the height of the first split. Second biggest is second split, etc. You can see that this gives the heights that you need by plotting.

plot(HC)
abline(h=10.75,col="red")

Dendrogram

You can see that the fourth biggest height matches the height of the fourth split.

Upvotes: 2

Related Questions