LJM
LJM

Reputation: 57

predict method on sklearn kmeans, how does it work and what is it doing?

I have been playing around with sklearn's k-means clustering class and I am confused about its predict method.

I have applied a model on the iris dataset like so:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

pca = PCA(n_components = 2).fit(X_train)

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

kmeans_pca = KMeans(n_clusters=3).fit(X_train_pca)

And have made predictions:

pred = kmeans_pca.predict(X_test_pca)

print(classification_report(y_test, pred))

          precision    recall  f1-score   support

       0       1.00      1.00      1.00        19
       1       0.76      0.87      0.81        15
       2       0.86      0.75      0.80        16

    accuracy                           0.88        50
   macro avg       0.87      0.87      0.87        50
weighted avg       0.88      0.88      0.88        50

The predictions seem adecuate, which has confused me as I have not passed in labels to the training set. I have read this post What is the use of predict() method in kmeans implementation of scikit learn? which tells me that the predict method is calling the closest cluster centroid to the test data. However, I don't know how sklearn correctly assigns the IDs during the training stage (i.e. kmeans_pca.labels_ to respective y_train) in the first place as the training stage does not involve labels.

I realise that k-means is not used for classification tasks, but I would like to know how these results were achieved. With this, what purpose could .predict() serve when performing k-means clustering in sklearn?

Upvotes: 2

Views: 2572

Answers (3)

DeepEspresso
DeepEspresso

Reputation: 31

KMeans clustering is an example of unsupervised learning. This means that, indeed, it does not take into account any labels for training.

Instead, examples are clustered entirely from patterns among the features - similar examples are grouped together. In case of the Iris dataset, different examples of the same flowers would tend to have similar lengths and widths of sepals and petals (i.e. the 'features' of the flower). That means that these features alone are giving away how to group the flowers - without any need of providing explicit labels.

To understand how the results are achieved, it might helpful to understand the algorithm. The following is the most common algorithm of KMeans and is based on the following steps:

  1. Initialize K different cluster centroids (possibly randomly, but not necessarily)
  2. Assign each example to the nearest cluster (e.g. based on Euclidean distance between feature vector and cluster centroids)
  3. Recalculate cluster centroids from cluster members found in step 2.

Steps 2 and 3 are repeated until convergence (i.e. when cluster assignments no longer change).

The above algorithm ultimately assigns similar examples to the same clusters and, hence, only cares about similarities between the features and not their labels.

The .predict() methods will give you the most likely cluster assignment of any test examples (e.g. 'flowers', as above). Indeed, this is done by assigning to the closest cluster centroid as learned above.

Upvotes: 3

Aditya
Aditya

Reputation: 1135

The KMeans clustering code assigns each data point to one of the K clusters that you have specified while fitting the KMeans clustering model. This means that it can randomly assign cluster ids to the data points in different runs, although the cluster id assigned to points belonging to the same cluster would remain the same.

E.g., for this example, consider that the cluster ids (labels) assigned to your data were - [1 1 0 0 2 2 2] for K=3, in the next run, they could have been [0 0 2 2 1 1 1]. Note that the cluster ids have changed, even though the points belonging to the same cluster have been assigned the same cluster-id.

In your case, during prediction, the model assigned the same cluster ids, although there could have been 6 different ways this could have gone since there are 3 clusters, and the total number of ways in which different allocation of cluster ids could be would be 6.

This was my output from doing the prediction on the KMeans clustering algorithm trained on the IRIS data.

print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        19
           1       0.00      0.00      0.00        15
           2       0.92      0.75      0.83        16

    accuracy                           0.24        50
   macro avg       0.31      0.25      0.28        50
weighted avg       0.30      0.24      0.26        50

As you can see, only the points belonging to cluster-id 2 were assigned the correct cluster as they were learnt during training and it misclassified for the remaining two clusters taking the overall accuracy low.

Upvotes: 1

Ghassen Sultana
Ghassen Sultana

Reputation: 1402

Clustering is an Unsupervised learning algorithme wich mean that it does need labels to train.

when you specify KMeans(n_clusters=3) that means the models will try to create 3 clusters.

The clustering algorithme will find in this case 3 centroid that will maximise the distance intercluster and minimise the ditance intracluster.

The clusters are attributed randomly so if you run the same algorithme without fixing the seed on the the same 4 points we can get different results example ( Run1 : [0,0,1,2] , Run2 : [1,1,0,2], Run3 : [2,2,0,1] ... ).

So once the model is trained we can predict (even the terms prediction is not adequate), which consists of giving for each line the label of the closest centroid.

Upvotes: 0

Related Questions