Reputation: 457
I'm trying to extract a classification from a dendrogram in R that I've cut
at a certain height. This is easy to do with cutree
on an hclust
object, but I can't figure out how to do it on a dendrogram
object.
Further, I can't just use my clusters from the original hclust, becuase (frustratingly), the numbering of the classes from cutree
is different from the numbering of classes with cut
.
hc <- hclust(dist(USArrests), "ave")
classification<-cutree(hc,h=70)
dend1 <- as.dendrogram(hc)
dend2 <- cut(dend1, h = 70)
str(dend2$lower[[1]]) #group 1 here is not the same as
classification[classification==1] #group 1 here
Is there a way to either get the classifications to map to each other, or alternatively to extract lower branch memberships from the dendrogram
object (perhaps with some clever use of dendrapply
?) in a format more like what cutree
gives?
Upvotes: 14
Views: 9324
Reputation: 1988
The accepted answer that uses the dendextend
package's extension of the base cutree
function to dendrogram objects is a fast and easy way to "get the classifications to map to each other" as OP requested.
Here is how to "alternatively to extract lower branch memberships from the dendrogram object" as OP suggested would also solve their problem. I needed to get this approach to work, because I was using the dend2$lower
sub-dendrograms in my analysis so I needed my membership key to match the cut()
indexing specifically.
# Use the dendextend package for necessity and the data.table package for convenience
library(data.table)
library(dendextend)
# start with OP's dend2 object created using cut(dend1)
# pull out the "labels" under each cluster using get_nodes_attr from the dendextend package
clust.key <- lapply(X = dend2$lower, FUN = get_nodes_attr, attribute = "label", include_branches = F)
# reformat into a table of cluster membership using rbindlist from the data.table package
clust.key <- rbindlist(lapply(X = 1:length(clust.key), FUN = function(e){data.table(cluster = e, label = clust.key[[e]])}))
# remove NA's that correspond to internal node labels instead of leaf labels (this is data.table syntax)
clust.key <- clust.key[!is.na(label)]
> head(clust.key)
cluster label
<int> <char>
1: 1 Florida
2: 1 North Carolina
3: 2 California
4: 2 Maryland
5: 2 Arizona
6: 2 New Mexico
And I would note that if you take this approach you can also still use dendextend's nice plotting to color clusters via the branches_attr_by_labels()
function, which also allows more complexity like coloring only a subset of clusters, for example:
branches_attr_by_labels(dend = dend1, labels = clust.key[cluster == 2, label], TF_values = "blue", attr = "col") %>%
branches_attr_by_labels(labels = clust.key[cluster == 4, label], TF_values = "orange", attr = "col") %>%
plot()
Upvotes: 0
Reputation: 1850
Once you make your dendogram, use the cutree method and then convert it to a dataframe. The following code makes a nice dendrogram using the library dendextend:
library(dendextend)
# set the number of clusters
clust_k <- 8
# make the hierarchical clustering
par(mar = c(2.5, 0.5, 1.0, 7))
d <- dist(mat, method = "euclidean")
hc <- hclust(d)
dend <- d %>% hclust %>% as.dendrogram
labels_cex(dend) <- .65
dend %>%
color_branches(k=clust_k) %>%
color_labels() %>%
highlight_branches_lwd(3) %>%
plot(horiz=TRUE, main = "Branch (Distribution) Clusters by Heloc Attributes", axes = T)
Based on the coloring scheme, it looks like the clusters are formed around the threshold of 4. So to get the assignments into a dataframe, we need to get the clusters and then unlist()
them.
First you need to get the clusters themselves, however, it is just a single vector of the number, the row names are the actual labels.
# creates a single item vector of the clusters
myclusters <- cutree(dend, k=clust_k, h=4)
# make the dataframe of two columns cluster number and label
clusterDF <- data.frame(Cluster = as.numeric(unlist(myclusters)),
Branch = names(myclusters))
# sort by cluster ascending
clusterDFSort <- clusterDF %>% arrange(Cluster)
Upvotes: 0
Reputation: 25306
I would propose for you to use the cutree
function from the dendextend package. It includes a dendrogram method (i.e.: dendextend:::cutree.dendrogram
).
You can learn more about the package from its introductory vignette.
I should add that while your function (classify
) is good, there are several advantage for using cutree
from dendextend:
It also allows you to use a specific k
(number of clusters), and not just h
(a specific height).
It is consistent with the result you would get from cutree on hclust (classify
will not be).
It will often be faster.
Here are examples for using the code:
# Toy data:
hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)
# Get the package:
install.packages("dendextend")
library(dendextend)
# Get the package:
cutree(dend1,h=70) # it now works on a dendrogram
# It is like using:
dendextend:::cutree.dendrogram(dend1,h=70)
By the way, on the basis of this function, dendextend allows the user to do more cool things, like color branches/labels based on cutting the dendrogram:
dend1 <- color_branches(dend1, k = 4)
dend1 <- color_labels(dend1, k = 5)
plot(dend1)
Lastly, here is some more code for demonstrating my other points:
# This would also work with k:
cutree(dend1,k=4)
# and would give identical result as cutree on hclust:
identical(cutree(hc,h=70) , cutree(dend1,h=70) )
# TRUE
# But this is not the case for classify:
identical(classify(dend1,70) , cutree(dend1,h=70) )
# FALSE
install.packages("microbenchmark")
require(microbenchmark)
microbenchmark(classify = classify(dend1,70),
cutree = cutree(dend1,h=70) )
# Unit: milliseconds
# expr min lq median uq max neval
# classify 9.70135 9.94604 10.25400 10.87552 80.82032 100
# cutree 37.24264 37.97642 39.23095 43.21233 141.13880 100
# 4 times faster for this tree (it will be more for larger trees)
# Although (if to be exact about it) if I force cutree.dendrogram to not go through hclust (which can happen for "weird" trees), the speed will remain similar:
microbenchmark(classify = classify(dend1,70),
cutree = cutree(dend1,h=70, try_cutree_hclust = FALSE) )
# Unit: milliseconds
# expr min lq median uq max neval
# classify 9.683433 9.819776 9.972077 10.48497 29.73285 100
# cutree 10.275839 10.419181 10.540126 10.66863 16.54034 100
If you are thinking of ways to improve this function, please patch it through here:
https://github.com/talgalili/dendextend/blob/master/R/cutree.dendrogram.R
I hope you, or others, will find this answer helpful.
Upvotes: 17
Reputation: 457
I ended up creating a function to do it using dendrapply
. It's not elegant, but it works
classify <- function(dendrogram,height){
#mini-function to use with dendrapply to return tip labels
members <- function(n) {
labels<-c()
if (is.leaf(n)) {
a <- attributes(n)
labels<-c(labels,a$label)
}
labels
}
dend2 <- cut(dendrogram,height) #the cut dendrogram object
branchesvector<-c()
membersvector<-c()
for(i in 1:length(dend2$lower)){ #for each lower tree resulting from the cut
memlist <- unlist(dendrapply(dend2$lower[[i]],members)) #get the tip lables
branchesvector <- c(branchesvector,rep(i,length(memlist))) #add the lower tree identifier to a vector
membersvector <- c(membersvector,memlist) #add the tip labels to a vector
}
out<-as.integer(branchesvector) #make the output a list of named integers, to match cut() output
names(out)<-membersvector
out
}
Using the function makes it clear that the problem is that cut assigns category names alphabetically while cutree assigns branch names left to right.
hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)
classify(dend1,70) #Florida 1, North Carolina 1, etc.
cutree(hc,h=70) #Alabama 1, Arizona 1, Arkansas 1, etc.
Upvotes: 8