ginopino
ginopino

Reputation: 125

How to find the k value for K-Means clustering using scikit in python

I have a Pandas DataFrame which looks like this:

          1         2         3         4         5         6         7         8         9         10        11   ...       467       468       469       470       471       472       473       474       475       476       477
1    1.000000  0.014085  0.134615  0.053030  0.109756  0.092105  0.095238  0.058824  0.104167  0.043478  0.135135  ...  0.045752  0.084112  0.098039  0.060870  0.000000  0.127273  0.043716  0.084615  0.068323  0.122449  0.172414
2    0.014085  1.000000  0.026667  0.039735  0.038095  0.074468  0.134021  0.084337  0.092593  0.184211  0.030303  ...  0.092025  0.107438  0.120690  0.021898  0.098361  0.176471  0.105820  0.127660  0.085714  0.132743  0.100000
3    0.134615  0.026667  1.000000  0.058824  0.054945  0.011494  0.089888  0.040541  0.078947  0.040541  0.141026  ...  0.050955  0.052174  0.063636  0.016000  0.000000  0.098361  0.048128  0.057971  0.072727  0.074766  0.068182
4    0.053030  0.039735  0.058824  1.000000  0.113924  0.056604  0.059880  0.039735  0.094170  0.039735  0.076433  ...  0.113636  0.104396  0.070652  0.072539  0.015152  0.042553  0.108434  0.081340  0.070833  0.059783  0.083333
5    0.109756  0.038095  0.054945  0.113924  1.000000  0.237113  0.102564  0.048077  0.120000  0.101010  0.090090  ...  0.064865  0.077465  0.111940  0.152174  0.011765  0.076087  0.070423  0.126582  0.082902  0.139535  0.145695
..        ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...  ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...       ...
473  0.043716  0.105820  0.048128  0.108434  0.070423  0.062802  0.150754  0.123656  0.333333  0.100000  0.110553  ...  0.178571  0.258706  0.257576  0.074689  0.033333  0.137143  1.000000  0.235556  0.202335  0.187500  0.147059
474  0.084615  0.127660  0.057971  0.081340  0.126582  0.118421  0.209459  0.074324  0.294737  0.135714  0.285714  ...  0.165094  0.215569  0.326667  0.071795  0.054264  0.221311  0.235556  1.000000  0.269608  0.287582  0.225275
475  0.068323  0.085714  0.072727  0.070833  0.082902  0.069149  0.337580  0.117647  0.259091  0.172840  0.286624  ...  0.139344  0.187817  0.204188  0.048035  0.050314  0.111111  0.202335  0.269608  1.000000  0.280899  0.175926
476  0.122449  0.132743  0.074766  0.059783  0.139535  0.130081  0.345455  0.132743  0.343750  0.361702  0.166667  ...  0.173913  0.312977  0.302326  0.059524  0.071429  0.229167  0.187500  0.287582  0.280899  1.000000  0.246753
477  0.172414  0.100000  0.068182  0.083333  0.145695  0.122449  0.200000  0.076923  0.248705  0.157895  0.144828  ...  0.157895  0.222222  0.220126  0.133333  0.065041  0.142857  0.147059  0.225275  0.175926  0.246753  1.000000

This is an example, but the number of row and columns may vary for other DataFrames. I need to cluster the values using K-Means in scikit, but I have no idea of how to find the correct number of cluster for my dataFrame. Any suggestion? Also, as I am new to python and being this the first time I use sci-kit, any easy explanation of how to perform the K-Means clustering would be much appreciated.

Upvotes: 0

Views: 1157

Answers (2)

gurezende
gurezende

Reputation: 206

Very common is the Elbow method (https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/), where you fit your data to the KMeans model and use the .inertia_ attribute to create a plot with the cluster value.

#Creating models for k from 2 to 14
inertia = []
for k in range(2,15):
    model = KMeans(n_clusters=k, random_state=12).fit(df)
    inertia.append(model.inertia_)

#Plotting the inertia of the models    
k_values = range(2,15)
plt.plot(k_values, inertia, 'o-')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')

You should see something similar to this, indicating that, for this specific data frame I have created for this exercise, the ideal K value would be 3, as this is where you see the "elbow" - and the following k values do not bring much change in inertia: Elbow method

Upvotes: 1

Saptarshi
Saptarshi

Reputation: 148

We usually use Elbow Method to find the value of "K" in K-means.

inertias=[]
for k in K:  
    clf= KMeans(n_clusters=k)
    clf.fit(X)     
    inertias.append(clf.inertia_)

plt.plot(inertias)

Image from towardsdatacience

Now from the plot, you have to find the breakpoint. For the provided image, from point 1-3, the inertia changes drastically. The rate of change reduces from point 4. That means, 4 will be the elbow point, i.e., k=4

For a detailed explanation, you may visit,

  1. Elbow Method for K Means
  2. GFG

Upvotes: 0

Related Questions