Reputation: 917
I am using SciPy's hierarchical agglomerative clustering methods to cluster a m x n matrix of features, but after the clustering is complete, I can't seem to figure out how to get the centroid from the resulting clusters. Below follows my code:
Y = distance.pdist(features)
Z = hierarchy.linkage(Y, method = "average", metric = "euclidean")
T = hierarchy.fcluster(Z, 100, criterion = "maxclust")
I am taking my matrix of features, computing the euclidean distance between them, and then passing them onto the hierarchical clustering method. From there, I am creating flat clusters, with a maximum of 100 clusters
Now, based on the flat clusters T, how do I get the 1 x n centroid that represents each flat cluster?
Upvotes: 16
Views: 11303
Reputation: 7602
A possible solution is a function, which returns a codebook with the centroids like kmeans
in scipy.cluster.vq
does. Only thing you need is the partition as vector with flat clusters part
and the original observations X
def to_codebook(X, part):
"""
Calculates centroids according to flat cluster assignment
Parameters
----------
X : array, (n, d)
The n original observations with d features
part : array, (n)
Partition vector. p[n]=c is the cluster assigned to observation n
Returns
-------
codebook : array, (k, d)
Returns a k x d codebook with k centroids
"""
codebook = []
for i in range(part.min(), part.max()+1):
codebook.append(X[part == i].mean(0))
return np.vstack(codebook)
Upvotes: 3
Reputation: 2133
You can do something like this (D
=number of dimensions):
# Sum the vectors in each cluster
lens = {} # will contain the lengths for each cluster
centroids = {} # will contain the centroids of each cluster
for idx,clno in enumerate(T):
centroids.setdefault(clno,np.zeros(D))
centroids[clno] += features[idx,:]
lens.setdefault(clno,0)
lens[clno] += 1
# Divide by number of observations in each cluster to get the centroid
for clno in centroids:
centroids[clno] /= float(lens[clno])
This will give you a dictionary with cluster number as the key and the centroid of the specific cluster as the value.
Upvotes: 1