Sierra Walker
Sierra Walker

Reputation: 106

K means plotting problems? not sure where I am going wrong, any suggestions?

I have been working with some data for quite some time now and I am trying to get 4 clusters using the k means method. I will put my code below so you can see where I am currently at. Im not sure if I missed a step or what but this is my first time doing k means clustering with python.

data=Bel_sort_cleaned.drop(['cprptp','JURISDICTION','STREET','ADDRESS','ddes1',                      
'DESCRIPTION','COM_BLDG_VALUE','OCCUPANCY','segments','OBJECTID'], axis=1)
data=data.dropna()

X=data.values.reshape(-1,1)
y=data['HOUSENUM'].values,reshape(-1,1)

kmeans = KMeans(n_clusters=4, random_state=0).fit(X)

label = kmeans.fit_predict(X)

plt.scatter(X[label==0, 0], X[label==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[label==1, 0], X[label==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[label==2, 0], X[label==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X[albel==3, 0], X[label==3, 1], s=100, c='cyan', label ='Cluster 4')

the original data was loaded in earlier in my file which is where the 'Bel_sort_cleaned' came from.

Any Ideas would be greatly appreciated, As I am pretty stuck.

I am currently getting an Indexerror

Upvotes: 2

Views: 1949

Answers (1)

dipetkov
dipetkov

Reputation: 3690

The issue is with how you preprocess the input features X before clustering but python stumbles upon the problem only when you attempt to plot the clusters.

So split the analysis in two parts: 1) clustering, 2) visualization. And make sure that part 1) works as intended before moving onto part 2).

Let's make this answer reproducible by providing the code to generate fake data for classification.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=100,
    n_features=20, n_informative=4,
    n_classes=4, n_clusters_per_class=1,
    random_state=1234,
)
data = pd.DataFrame(np.c_[X, y])
X.shape, y.shape, data.shape
#> ((100, 21), (100,), (100, 21))

Why do you flatten X? This doesn't make sense and it's where the bug occurs.

X = data.values.reshape(-1, 1)
X.shape
#> (2100, 1)

Instead let's use all the columns in the data frame. Or even better, for your actual analysis, specify the columns to use as input features for clustering. For example, do you really want to use HOUSENUM and OBJECTID to generate the K-Means clusters?

X = data.values
X.shape
#> (100, 21)

kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(X)

Check that there are 4 centroids, one for each cluster. The centroids have the same dimensions as the number of input features X.

kmeans.cluster_centers_.shape
#> (4, 21)

We've checked that there are 4 clusters. Finally we are ready to plot them. And we plot them along dimensions 1 and 2, which correspond to the first two features in the feature matrix X.

label = kmeans.predict(X)

plt.scatter(X[label == 0, 0], X[label == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X[label == 1, 0], X[label == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X[label == 2, 0], X[label == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(X[label == 3, 0], X[label == 3, 1], s=100, c='cyan', label='Cluster 4')

enter image description here

Upvotes: 1

Related Questions