Test
Test

Reputation: 550

K-Means algorithm Centroids are not placed in the clusters

I have a problem. I want to cluster my dataset. Unfortunately my centroids are not in the clusters but outside. I have already read Python k-mean, centroids are placed outside of the clusters about this.

However, I do not know what could be the reason. How can I cluster correctly?

You can find the dataset at https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset

import pandas as pd
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import silhouette_samples
import matplotlib as mpl
import matplotlib.pyplot as plt

df = pd.read_csv(r'https://gist.githubusercontent.com/Coderanker3/24c948d2ff0b7f71e51b3774c2cc7b22/raw/253ba0660720de3a9cf7dee2a2d25a37f61095ca/Dataset')
df.shape

features_clustering = ['review_scores_accuracy',
 'distance_to_center',
 'bedrooms',
 'review_scores_location',
 'review_scores_value',
 'number_of_reviews',
 'beds',
 'review_scores_communication',
 'accommodates',
 'review_scores_checkin',
 'amenities_count',
 'review_scores_rating',
 'reviews_per_month',
 'corrected_price']

df_cluster = df[features_clustering].copy()
X = df_cluster.copy()

model = KMeans(n_clusters=4, random_state=53, n_init=10, max_iter=1000, tol=0.0001)
clusters = model.fit_predict(X)
df_cluster["cluster"] = clusters

fig = plt.figure(figsize=(8, 8))
sns.scatterplot(data=df_cluster, x="amenities_count", y="corrected_price", hue="cluster", palette='Set2_r')
sns.scatterplot(x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1], color='blue',marker='*',
                            label='centroid', s=250)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
#plt.ylim(ymin=0)
plt.xlim(xmin=-0.1)
plt.show()

model.cluster_centers_

enter image description here

inertia = model.inertia_
sil = metrics.silhouette_score(X,model.labels_)

print(f'inertia {inertia:.3f}')
print(f'silhouette {sil:.3f}')

[OUT]

inertia 4490.076
silhouette 0.156

Upvotes: 0

Views: 1500

Answers (2)

Claudio
Claudio

Reputation: 84

You are making multidimensional clusters and you want them to fit a two-dimensional map, by itself it will not work. Let me explain, a variable is a dimension: x1,x2,x3,...,xn and if you find the clusters it will give you as a result y1,y2,y3,...,yn. If you map in 2D the result as you are doing, (I take your example) x1 is "amenities_count", x5 is "corrected_price".

It will create a 2D map of only these two variables and surely the plotter, seeing that you use a 2D map, will only take the first two variables from cluster, y1 and y2 to plot. Note that xi has no direct relationship with y1.

You must: 1) do a conversion to find its corresponding x,y or 2) reduce the dimensionality of the data you are using to generate a 2D map with the information of all the variables.

For the first case, I am not very sure because I have never done it (Remapping the data). But in the dimensionality reduction, I recommend you to use https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding or the classic PCA.

Moral: if you want to see a 2D cluster, make sure you only have 2 variables.

Upvotes: 1

ChrisOram
ChrisOram

Reputation: 1434

The answer to your main question: the cluster centers are not outside of your clusters.

1 : You are clustering over 14 features shown in features_clustering list.

2 : You are viewing the clusters over a two-dimensional space, arbitrarily choosing amenities_count and corrected_price for the data and two coordinates for the cluster centers x=model.cluster_centers_[:,0], y=model.cluster_centers_[:,1] which don't correspond to the same features.

For these reasons you are going to get strange results; they really don't mean anything.

The bottom line is you cannot view 14 dimension clustering over two-dimensions.

To show point 2 more clearly, change the plotting of the clusters line to

sns.scatterplot(x=model.cluster_centers_[:,10], y=model.cluster_centers_[:,13], color='blue',marker='*', label='centroid', s=250)

to be plotting the cluster centers against the same features as the data.


The link to the SO answer about the cluster centers being outside of the cluster data is about scaling the data before clustering to be between 0 and 1, and then not scaling the cluster centers back up when plotting with the real data. This is not the same as your issues here.

Upvotes: 2

Related Questions