user1354607
user1354607

Reputation: 131

Pruning dendrogram in scipy (hierarchical clustering)

I have a distance matrix with about 5000 entries, and use scipy's hierarchical clustering methods to cluster the matrix. The code I use for this is the following snippet:

Y = fastcluster.linkage(D, method='centroid') # D-distance matrix
Z1 = sch.dendrogram(Y,truncate_mode='level', p=7,show_contracted=True)

Since the dendrogram will become rather dense with all this data, I use the truncate_mode to prune it a bit. All of this works, but I wonder how I can find out which of the original 5000 entries belong to a particular branch in the dendrogram.

I tried using

 leaves = sch.leaves_list(Y)

to get a list of leaves, but this uses the linkage output as indata, and while I can see the correspondence between the pruned dendrogram and the leaves-list, it becomes a bit cumbersome to map original entries manually to the dendrogram.

To summarize: Is there a way of listing all the original entries in the distance matrix that belongs to a branch in a pruned dendrogram? Or are there other methods of doing this that I am not aware of.

Thanks

Upvotes: 13

Views: 3822

Answers (1)

Dhara
Dhara

Reputation: 6767

One of the dictionary data-structures returned by scipy.cluster.hierarchy.dendrogram has the key ivl, that the documentation describes as:

a list of labels corresponding to the leaf nodes

You can supply custom labels (using labels=<array of lables>) as input to the dendrogram function but by default, they are just indices of the original observation. By comparing the original labels/indices and Z1['ivl'], you can determine what the original entries were.

Upvotes: 3

Related Questions