Michael
Michael

Reputation: 2556

What are the centroid of k-means clusters with PCA decomposition?

From a dataset in which I am using PCA and kmeans, I would like to know what are the central objects in each cluster.

What is the best way to describe these objects as iris from my original dataset ?

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.decomposition import PCA
pca = PCA(n_components=2, whiten=True).fit(X)
X_pca = pca.transform(X)
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3).fit(X_pca)


# I can get the central object from the reduced data but this does not help me describe 
# the properties of the center of each cluster
from sklearn.metrics import pairwise_distances_argmin_min
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X_pca)
for i in closest:
    print X_pca[i]

Upvotes: 0

Views: 4261

Answers (2)

jakevdp
jakevdp

Reputation: 86320

There are two ways to do what you ask.

You can get the nearest approximation of the centers in the original feature space using PCA's inverse transform:

centers = pca.inverse_transform(kmeans.cluster_centers_)
print(centers)

[[ 6.82271303  3.13575974  5.47894833  1.91897312]
 [ 5.80425955  2.67855286  4.4229187   1.47741067]
 [ 5.03012829  3.42665848  1.46277424  0.23661913]]

Or, you can recompute the mean in the original space using the original data and the cluster labels:

for label in range(kmeans.n_clusters):
    print(X[kmeans.labels_ == label].mean(0))

[ 6.8372093   3.12093023  5.4627907   1.93953488]
[ 5.80517241  2.67758621  4.43103448  1.45689655]
[ 5.01632653  3.44081633  1.46734694  0.24285714]

Even though the resulting centers are not in the original dataset, you can treat them as if they are! For example, if you're clustering images, the resulting centers can be viewed as images to get insight into the clustering. Alternatively, you can do a nearest-neighbor search on these results to recover the original data point that most closely approximates the center.

Keep in mind, though, that PCA is lossy and KMeans is fast, and so it's probably going to be more useful to run KMeans on the full, unprojected data:

print(KMeans(3).fit(X).cluster_centers_)

[[ 6.85        3.07368421  5.74210526  2.07105263]
 [ 5.9016129   2.7483871   4.39354839  1.43387097]
 [ 5.006       3.418       1.464       0.244     ]]

In this simple case, all three methods produce very similar results.

Upvotes: 4

Synedraacus
Synedraacus

Reputation: 1045

I'm sorry if this is not exactly the answer, but why are you using PCA at all? You are reducing data from four to two dimensions, which is one-way operation: you won't get back all four parameters from two, and you may also slightly compromise distance estimations (therefore clustering). On the other hand, if you use k-means on raw data, cluster centers will be described by the same property list as original items.

Upvotes: 0

Related Questions