Jindřich
Jindřich

Reputation: 11240

Coloring specific links in a dendrogram

In a dendrogram from a hierarchical clustering in scipy, I would like to highlight links connecting specific two labels, let's say 0 and 1.

import scipy.cluster.hierarchy as hac
from matplotlib import pyplot as plt

clustering = hac.linkage(points, method='single', metric='cosine')
link_colors = ["black"] * (2 * len(points) - 1)
hac.dendrogram(clustering, link_color_func=lambda k: link_colors[k])
plt.show()

The clustering has the following format: clustering[i] corresponds to node number len(points) + i and its first two numbers are indices of nodes that are linked. Nodes with indices smaller than len(points) correspond to original points, higher indices to the clusters.

When drawing the dendrogram, different indexing of the links is used and these are the indices that are used for choosing the color. How do the indices of the links (as indexed in link_colors) correspond to indices in clustering?

Upvotes: 3

Views: 1021

Answers (1)

gehbiszumeis
gehbiszumeis

Reputation: 3711

You have been very close to the solution. The indices in clustering are sorted by size of the 3rd columns of the clustering array. The indices of the color list for link_color_func are indices of clustering + the length of points.

import scipy.cluster.hierarchy as hac
from matplotlib import pyplot as plt
import numpy as np

# Sample data
points = np.array([[8, 7, 7, 1], 
            [8, 4, 7, 0], 
            [4, 0, 6, 4], 
            [2, 4, 6, 3], 
            [3, 7, 8, 5]])

clustering = hac.linkage(points, method='single', metric='cosine')

clustering does look like this

array([[3.        , 4.        , 0.00766939, 2.        ],
       [0.        , 1.        , 0.02763245, 2.        ],
       [5.        , 6.        , 0.13433008, 4.        ],
       [2.        , 7.        , 0.15768043, 5.        ]])

As you can see the ordering (and thus the row-index) results from clustering being sorted by the third column.

To highlight now a specific link (e.g. [0,1] as you proposed) you have to find the row index of the pair [0,1] within clustering and add len(points). The resulting number is the index of the color list provided for link_color_func.

# Initialize the link_colors list with 'black' (as you did already)
link_colors = ['black'] * (2 * len(points) - 1)
# Specify link you want to have highlighted
link_highlight = (0, 1)
# Find index in clustering where first two columns are equal to link_highlight. This will cause an exception if you look for a link, which is not in clustering (e.g. [0,4])
index_highlight = np.where((clustering[:,0] == link_highlight[0]) * 
                           (clustering[:,1] == link_highlight[1]))[0][0]
# Index in color_list of desired link is index from clustering + length of points
link_colors[index_highlight + len(points)] = 'red'

hac.dendrogram(clustering, link_color_func=lambda k: link_colors[k])
plt.show()

Like this, you can highlight the desired link:

enter image description here

It works also for links between an original element and a cluster or between two clusters (e.g. link_highlight = (5, 6))

enter image description here

Upvotes: 3

Related Questions