intelligent
intelligent

Reputation: 57

Get cluster points after KMeans in a list format

Suppose I clustered a data set using sklearn's K-means.

I can see the centroids easily using KMeans.cluster_centers_ but I need to get the clusters as I get centroids.

How can I do that?

Upvotes: 3

Views: 7057

Answers (3)

premvardhan
premvardhan

Reputation: 81

It's been very long asked question so I think you already have the answer but let me post as someone can be benefited from it. We can get cluster points by just using its centroid. Scikit-learn has an attribute called cluster_centers_ which returns n_clusters and n_features. The very simple code that you can see it below that to describe the cluster center and please go through all the comments in the code.

import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

# Iris data
iris = datasets.load_iris()
X = iris.data
# Standardization
std_data = StandardScaler().fit_transform(X)

# KMeans clustering with 3 clusters
clf =  KMeans(n_clusters = 3)
clf.fit(std_data)

# Coordinates of cluster centers with shape [n_clusters, n_features]
# As we have 3 cluster with 4 features
print("Shape of cluster:", clf.cluster_centers_.shape)

# Scatter plot to see each cluster points visually 
plt.scatter(std_data[:,0], std_data[:,1], c = clf.labels_, cmap = "rainbow")
plt.title("K-means Clustering of iris data flower")
plt.show()

# Putting ndarray cluster center into pandas DataFrame
coef_df = pd.DataFrame(clf.cluster_centers_, columns = ["Sepal length", "Sepal width", "Petal length", "Petal width"])
print("\nDataFrame containg each cluster points with feature names:\n", coef_df)

# converting ndarray to a nested list 
ndarray2list = clf.cluster_centers_.tolist()
print("\nList of clusterd points:\n")
print(ndarray2list)

OUTPUTS: This is the output of the above code.

Upvotes: -1

seralouk
seralouk

Reputation: 33137

You need to do the following (see comments in my code):

import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets

np.random.seed(0)

# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# KMeans with 3 clusters
clf =  KMeans(n_clusters=3)
clf.fit(X,y)

#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_

#Labels of each point
clf.labels_

# !! Get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}

# Transform the dictionary into list
dictlist = []
for key, value in mydict.iteritems():
    temp = [key,value]
    dictlist.append(temp)

RESULTS

{0: array([ 50,  51,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
            64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,
            78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
            91,  92,  93,  94,  95,  96,  97,  98,  99, 101, 106, 113, 114,
           119, 121, 123, 126, 127, 133, 138, 142, 146, 149]),
 1: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
           17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
           34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
 2: array([ 52,  77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
           115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
           134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])}


[[0, array([ 50,  51,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
             64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,
             78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
             91,  92,  93,  94,  95,  96,  97,  98,  99, 101, 106, 113, 114,
             119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
 [1, array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
 [2, array([ 52,  77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
             115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
             134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]

Upvotes: 2

Jan K
Jan K

Reputation: 4150

You probably look for the attribute labels_.

Upvotes: 3

Related Questions