davnovak
davnovak

Reputation: 57

Creating dendrograms manually: how to fix "'merge' matrix has invalid contents" in plot.hclust?

I am creating an hclust object manually (i.e. creating a list with the required slots, then changing its class to hclust). The merging pattern, heights of bifurcations, ordering of leaf nodes and labels of leaf nodes are known. My goal (and means of testing) is to plot the resulting dendrogram. I am unable to create a plottable hclust object with my parameters.

The components of an hclust object are described in the hclust function documentation here (see section Value).

The following is a reproducible chunk of R code I use to generate and plot my dendrogram.

tree <- list()
tree$merge <- matrix(c( -1,  -7,  # row  1
                        -2,  -6,  # row  2
                        -3, -12,  # row  3
                        -4, -14,  # row  4
                        -5,  -8,  # row  5
                        -9, -11,  # row  6
                       -13, -20,  # row  7
                       -15, -19,  # row  8
                         1,   8,  # row  9
                         2,   5,  # row 10
                         3,   6,  # row 11
                         2, -18,  # row 12
                         1,   3,  # row 13
                         2,   4,  # row 14
                       -10,   7,  # row 15
                       -16, -17,  # row 16
                         1,   2,  # row 17
                        15,  16,  # row 18
                         1,  15), # row 19
                     ncol = 2,
                     byrow = TRUE)
tree$height <- c(0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.11167131, 0.11167131, 0.11167131, 0.12832304, 0.17304035, 0.17304035, 0.17304035, 0.17304035, 0.22965349, 0.22965349, 0.23334799)
tree$labels <- as.character(1:20)
tree$order <- c(1, 7, 15, 19, 3, 12, 9, 11, 2, 6, 5, 8, 18, 4, 14, 13, 20, 10, 16, 17)
class(tree) <- "hclust"
plot(tree)

Each row of the tree$merge matrix corresponds to a bifurcation. Negative integers refer to the indices of a leaf nodes, whereas positive integers refer to existing clusters by row indices in tree$merge.

Running the code results in the following error message.

Error in plot.hclust(tree) : 'merge' matrix has invalid contents

A sketch of the intended result is pictured below, with the heights values marked by additional dotted lines. (The drawing is not to scale.)

Intended dendrogram

Upvotes: 1

Views: 1074

Answers (1)

slava-kohut
slava-kohut

Reputation: 4233

The validity of a hclust tree is checked by the .validity.hclust function. Its source code is given here. Look at lines 121-135.

That you got the error means that your tree is not valid because of its merge matrix. It has non-unique elements (e.g., 1 and 2). In a properly constructed merge matrix, all entries are unique and run from -N_obs to N_obs-2 (zero excluded), where N_obs is a (positive) number of observations. This is checked by the following if test in the code:

if(identical(sort(as.integer(merge)), c(-(n:1L), +seq_len(n-2L))))
    TRUE
else
    "'merge' matrix has invalid contents"

From the reference of hclust:

merge an n − 1 by 2 matrix.

Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation − j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.

All negative entries are singletons (observations), and positive numbers are merges of existing clusters and refer to merging steps of the algorithm.

So, revise your hclust object. Here is some code to give you an idea what a proper hclust object looks like:

iris2 <- iris[1:20,-5]
species_labels <- iris[,5]
d_iris <- dist(iris2)
tree_iris <- hclust(d_iris, method = "complete")

Take a closer look at tree_iris$merge.

UPDATE

After I got more time, I decided to fix your code. I modified the merge entry of the tree. This is what the working code that reproduces your dendrogram looks like:

tree <- list()
tree$merge <- matrix(c( -1,  -7,  # row  1
                        -2,  -6,  # row  2
                        -3, -12,  # row  3
                        -4, -14,  # row  4
                        -5,  -8,  # row  5
                        -9, -11,  # row  6
                        -13, -20,  # row  7
                        -15, -19,  # row  8
                        1,   8,  # row  9: 1,7,15,19
                        2,   5,  # row 10: 2,6,5,8
                        3,   6,  # row 11: 3,12,9,11
                        10, -18,  # row 12: 2,6,5,8 + 18
                        9,   11,  # row 13:  1,7,15,19 + 3,12,9,11
                        12,   4,  # row 14: row 12 + row 4
                        -10,   7,  # row 15: row 7 + 10
                        -16, -17,  # row 16
                        13,   14,  # row 17: row 13 + row 14 
                        15,  16,  # row 18: row 15 + row 16
                        17,  18), # row 19: row 17 + row 18
                     ncol = 2,
                     byrow = TRUE)
tree$height <- c(0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.06573653, 0.11167131, 0.11167131, 0.11167131, 0.12832304, 0.17304035, 0.17304035, 0.17304035, 0.17304035, 0.22965349, 0.22965349, 0.23334799)
tree$labels <- as.character(1:20)
tree$order <- c(1, 7, 15, 19, 3, 12, 9, 11, 2, 6, 5, 8, 18, 4, 14, 13, 20, 10, 16, 17)
class(tree) <- "hclust"
plot(tree)

Upvotes: 2

Related Questions