jacky_learns_to_code
jacky_learns_to_code

Reputation: 824

K-means clustering using sklearn.cluster

I came across this tutorial on K-means clustering on Unsupervised Machine Learning: Flat Clustering, and below is the code:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")

from sklearn.cluster import KMeans

X = np.array([[1,2],[5,8],[1.5,1.8],[1,0.6],[9,11]])

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

centroid = kmeans.cluster_centers_
labels = kmeans.labels_

print (centroid)
print(labels)

colors = ["g.","r.","c."]

for i in range(len(X)):
   print ("coordinate:" , X[i], "label:", labels[i])
   plt.plot(X[i][0],X[i][1],colors[labels[i]],markersize=10)

plt.scatter(centroid[:,0],centroid[:,1], marker = "x", s=150, linewidths = 5, zorder =10)

plt.show()

In this example, the array has only 2 features [1,2],[5,8],[1.5,1.8] etc.

I have tried to replace the X with 10 x 750 matrix (750 features) stored in an np.array(). The graph it created just does not make any sense.

How could I alter the above code to solve my problem?

Upvotes: 3

Views: 12301

Answers (2)

meelo
meelo

Reputation: 582

Practically, It's impossible to visualize 750 dimension data directly.

But there are other way going around, for example, doing dimention reduction first using PCA to a farily low dimention, like 4. Scikit-learn also provides a function for this.

Then you can draw a matrix of plot, with each plot only have two features. Using Pandas package, you can draw these plot very easily with scatter_matrix function.

Note that, in your case you only using PCA for visualization, you should still doing K-means clustering on original data, after getting the centroids, doing the PCA for the centroids using the PCA model you create before.

Here is an example plot created by scatter_matrix function: enter image description here

Upvotes: 3

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77485

Visualizing 750 dimensions is hard.

Figure out independent of k-means how to visualize.

But don't expect k-means to return meaningful results on such data... it is very sensitive to preprocessing and normalization, and most likely your 750 dimensions are not on the exact same continuous numerical scale.

Upvotes: 0

Related Questions