Reputation: 125
I have a Pandas DataFrame which looks like this:
1 2 3 4 5 6 7 8 9 10 11 ... 467 468 469 470 471 472 473 474 475 476 477
1 1.000000 0.014085 0.134615 0.053030 0.109756 0.092105 0.095238 0.058824 0.104167 0.043478 0.135135 ... 0.045752 0.084112 0.098039 0.060870 0.000000 0.127273 0.043716 0.084615 0.068323 0.122449 0.172414
2 0.014085 1.000000 0.026667 0.039735 0.038095 0.074468 0.134021 0.084337 0.092593 0.184211 0.030303 ... 0.092025 0.107438 0.120690 0.021898 0.098361 0.176471 0.105820 0.127660 0.085714 0.132743 0.100000
3 0.134615 0.026667 1.000000 0.058824 0.054945 0.011494 0.089888 0.040541 0.078947 0.040541 0.141026 ... 0.050955 0.052174 0.063636 0.016000 0.000000 0.098361 0.048128 0.057971 0.072727 0.074766 0.068182
4 0.053030 0.039735 0.058824 1.000000 0.113924 0.056604 0.059880 0.039735 0.094170 0.039735 0.076433 ... 0.113636 0.104396 0.070652 0.072539 0.015152 0.042553 0.108434 0.081340 0.070833 0.059783 0.083333
5 0.109756 0.038095 0.054945 0.113924 1.000000 0.237113 0.102564 0.048077 0.120000 0.101010 0.090090 ... 0.064865 0.077465 0.111940 0.152174 0.011765 0.076087 0.070423 0.126582 0.082902 0.139535 0.145695
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
473 0.043716 0.105820 0.048128 0.108434 0.070423 0.062802 0.150754 0.123656 0.333333 0.100000 0.110553 ... 0.178571 0.258706 0.257576 0.074689 0.033333 0.137143 1.000000 0.235556 0.202335 0.187500 0.147059
474 0.084615 0.127660 0.057971 0.081340 0.126582 0.118421 0.209459 0.074324 0.294737 0.135714 0.285714 ... 0.165094 0.215569 0.326667 0.071795 0.054264 0.221311 0.235556 1.000000 0.269608 0.287582 0.225275
475 0.068323 0.085714 0.072727 0.070833 0.082902 0.069149 0.337580 0.117647 0.259091 0.172840 0.286624 ... 0.139344 0.187817 0.204188 0.048035 0.050314 0.111111 0.202335 0.269608 1.000000 0.280899 0.175926
476 0.122449 0.132743 0.074766 0.059783 0.139535 0.130081 0.345455 0.132743 0.343750 0.361702 0.166667 ... 0.173913 0.312977 0.302326 0.059524 0.071429 0.229167 0.187500 0.287582 0.280899 1.000000 0.246753
477 0.172414 0.100000 0.068182 0.083333 0.145695 0.122449 0.200000 0.076923 0.248705 0.157895 0.144828 ... 0.157895 0.222222 0.220126 0.133333 0.065041 0.142857 0.147059 0.225275 0.175926 0.246753 1.000000
This is an example, but the number of row and columns may vary for other DataFrames. I need to cluster the values using K-Means in scikit, but I have no idea of how to find the correct number of cluster for my dataFrame. Any suggestion? Also, as I am new to python and being this the first time I use sci-kit, any easy explanation of how to perform the K-Means clustering would be much appreciated.
Upvotes: 0
Views: 1157
Reputation: 206
Very common is the Elbow method (https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/), where you fit your data to the KMeans model and use the .inertia_ attribute to create a plot with the cluster value.
#Creating models for k from 2 to 14
inertia = []
for k in range(2,15):
model = KMeans(n_clusters=k, random_state=12).fit(df)
inertia.append(model.inertia_)
#Plotting the inertia of the models
k_values = range(2,15)
plt.plot(k_values, inertia, 'o-')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
You should see something similar to this, indicating that, for this specific data frame I have created for this exercise, the ideal K value would be 3, as this is where you see the "elbow" - and the following k values do not bring much change in inertia:
Upvotes: 1
Reputation: 148
We usually use Elbow Method to find the value of "K" in K-means.
inertias=[]
for k in K:
clf= KMeans(n_clusters=k)
clf.fit(X)
inertias.append(clf.inertia_)
plt.plot(inertias)
Now from the plot, you have to find the breakpoint. For the provided image, from point 1-3, the inertia changes drastically. The rate of change reduces from point 4. That means, 4 will be the elbow point, i.e., k=4
For a detailed explanation, you may visit,
Upvotes: 0