pythonLover
pythonLover

Reputation: 5

Elbow Method for kmeans

I am working on a clustering task and I used the Elbow Method to get the optimal number of clusters (k) , but I get a linear plot and I am not able to determine the k from the plot. [enter image description here][2]

Thank you

enter image description here

Upvotes: 0

Views: 3405

Answers (2)

ASH
ASH

Reputation: 20302

There are many ways to do this kind of thing.  For one thing, you can use Yellowbrick to do the work.


import pandas as pd
import matplotlib as mpl 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn import datasets

from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

mpl.rcParams["figure.figsize"] = (9,6)

# Load iris flower dataset
iris = datasets.load_iris()

X = iris.data #clustering is unsupervised learning hence we load only X(i.e.iris.data) and not Y(i.e. iris.target)
# Converting the data into dataframe
feature_names = iris.feature_names
iris_dataframe = pd.DataFrame(X, columns=feature_names)
iris_dataframe.head(10)

# Fitting the model with a dummy model, with 3 clusters (we already know there are 3 classes in the Iris dataset)
k_means = KMeans(n_clusters=3)
k_means.fit(X)

# Plotting a 3d plot using matplotlib to visualize the data points
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(111, projection='3d')

# Setting the colors to match cluster results
colors = ['red' if label == 0 else 'purple' if label==1 else 'green' for label in k_means.labels_]

ax.scatter(X[:,3], X[:,0], X[:,2], c=colors)

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,11))

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.show()    # Draw/show/show the data

enter image description here

Please see the links below for more info.

https://notebook.community/DistrictDataLabs/yellowbrick/examples/gokriznastic/Iris%20-%20clustering%20example

https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20-%20Historical%20Stock%20Prices.ipynb

Upvotes: 0

Oren Matar
Oren Matar

Reputation: 750

I recommend you use silhouette score to determine the number of clusters, it doesn't require you looking at a plot and can be fully automatic - just try different k values and select the one with the minimum silhouette score:

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

However, it doesn't look like this will solve your problem in this specific case. If the data points are distributed pretty evenly over the space, meaning they don't really form any clusters, there will be no best k value. Check out the last row here as an example:

https://scikit-learn.org/stable/modules/clustering.html

k means does technically create different clusters, but they are not really apart from one another as you would want clusters to be. In such cases, there will be no minimal silhouette score, and the elbow method won't work. That's probably what happened in your case, there are no true clusters in the data...

Upvotes: 1

Related Questions