ADJ
ADJ

Reputation: 5282

how to get cluster history in scikit Ward?

I'd like to be able to run scikit's hierarchical clustering algorithm (scikit.cluster.Ward) and get the whole history of how observation have been clustered together, from the first iteration of the algorithm (each observation is its own cluster) to the last iteration (all observations in one cluster). Can scikit do that? The information I'm after is, for each iteration, the cluster/observation being joined to, the cluster/observation being added, the distance between the two.

Upvotes: 1

Views: 479

Answers (1)

eickenberg
eickenberg

Reputation: 14377

It's all in ward.children. Each line of it tells you which 2 features were clustered together, thus creating a new one. So in the end there will be 2 * n_features - 1 indices indexing clusters as they grew.

import numpy as np
from scipy.ndimage import gaussian_filter1d
n_samples, n_feat1, n_feat2 = 400, 20, 20
X = np.random.randn(n_samples, n_feat1, n_feat2)
X = gaussian_filter1d(X, sigma=2, axis=1)
X = gaussian_filter1d(X, sigma=2, axis=2)

from sklearn.feature_extraction.image import grid_to_graph
connectivity = grid_to_graph(n_feat1, n_feat2)

from sklearn.cluster import WardAgglomeration
ward = WardAgglomeration(connectivity=connectivity)

ward.fit(X.reshape(n_samples, -1))

print ward.children

array([[ 35,  15],
       [ 36,  16],
       [ 34,  14],
       [181, 180],
       [201, 200],
       [161, 160],
       [241, 240],
       [339, 338],
       [221, 220],
       [24,   4],
       ...])

There are 400 features (indexed by 0-399). The first merge is between features 35 and 15, yielding feature 400. The second merge is between features 36 and 16, yielding feature 401. The third merge is between 34 and 14, yielding 402, and so on.

Note that sklearn.cluster.Ward is deprecated in 0.17 and will be replaced by AgglomerativeClustering.

Upvotes: 3

Related Questions