What are the centroid of k-means clusters with PCA decomposition?

Question

From a dataset in which I am using PCA and kmeans, I would like to know what are the central objects in each cluster.

What is the best way to describe these objects as iris from my original dataset ?

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.decomposition import PCA
pca = PCA(n_components=2, whiten=True).fit(X)
X_pca = pca.transform(X)
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3).fit(X_pca)


# I can get the central object from the reduced data but this does not help me describe 
# the properties of the center of each cluster
from sklearn.metrics import pairwise_distances_argmin_min
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X_pca)
for i in closest:
    print X_pca[i]

jakevdp · Accepted Answer

There are two ways to do what you ask.

You can get the nearest approximation of the centers in the original feature space using PCA's inverse transform:

centers = pca.inverse_transform(kmeans.cluster_centers_)
print(centers)

[[ 6.82271303  3.13575974  5.47894833  1.91897312]
 [ 5.80425955  2.67855286  4.4229187   1.47741067]
 [ 5.03012829  3.42665848  1.46277424  0.23661913]]

Or, you can recompute the mean in the original space using the original data and the cluster labels:

for label in range(kmeans.n_clusters):
    print(X[kmeans.labels_ == label].mean(0))

[ 6.8372093   3.12093023  5.4627907   1.93953488]
[ 5.80517241  2.67758621  4.43103448  1.45689655]
[ 5.01632653  3.44081633  1.46734694  0.24285714]

Even though the resulting centers are not in the original dataset, you can treat them as if they are! For example, if you're clustering images, the resulting centers can be viewed as images to get insight into the clustering. Alternatively, you can do a nearest-neighbor search on these results to recover the original data point that most closely approximates the center.

Keep in mind, though, that PCA is lossy and KMeans is fast, and so it's probably going to be more useful to run KMeans on the full, unprojected data:

print(KMeans(3).fit(X).cluster_centers_)

[[ 6.85        3.07368421  5.74210526  2.07105263]
 [ 5.9016129   2.7483871   4.39354839  1.43387097]
 [ 5.006       3.418       1.464       0.244     ]]

In this simple case, all three methods produce very similar results.

What are the centroid of k-means clusters with PCA decomposition?

Answers (2)

Related Questions