GKC
GKC

Reputation: 479

KMeans clustering won't work on a dataframe with more than 4 columns

I have asked a similar question here: How to apply KMeans to get the centroid using dataframe with multiple features and I received some valuable responses. However, I have not succeeded in getting KMeans clustering working on a dataframe with more than 4 columns.

The dataframe in question has 5 columns as below:

col1,col2,col3,col4,col5
0.54,0.68,0.46,0.98,0.15
0.52,0.44,0.19,0.29,0.44
1.27,1.15,1.32,0.60,0.14
0.88,0.79,0.63,0.58,0.18
1.39,1.15,1.32,0.41,0.44
0.86,0.80,0.65,0.65,0.11
1.68,1.99,3.97,0.16,0.55
0.78,0.63,0.40,0.36,0.10
2.95,2.66,7.11,0.18,0.15
1.44,1.33,1.79,0.24,0.22

I have a simple KMeans clustering python code that I try to apply on the 5 column dataframe like below.

from numpy import unique
from numpy import where
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

X = np.array(df)

model = KMeans(n_clusters=5)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
    row_ix = where(yhat == cluster)
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4])
pyplot.show()

When I run the code it complains about the line pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4]), with the error message 'ValueError: Unrecognized marker style [[0.14 0.44 0.22]]'. However, if I remove the 5th column from the dataframe (i.e. col5) and remove X[row_ix, 4] from the code, the clustering works.

What do I need to do to get KMeans working on my example dataframe?

[Updated: 2 or 3 dimension at a time]

From the previous post, it was suggested I could split the task by representing 2 or 3 dimensions at a time using the below function. However, the function does not produce the expected clustering output (see attached output.png) enter image description here

def plot(self):
    

    import itertools
    combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features
    
    fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination

    for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos
        
        
        for i, index in enumerate(self.clusters):
            point = self.X[index].T
            
            # only get the coordinates for this combination:
            px, py = point[x], point[y]
            ax.scatter(px, py)

        for point in self.centroids:
            
            # only get the coordinates for this combination:
            px, py = point[x], point[y]
            
            ax.scatter(px, py, marker="x", color='black', linewidth=2)

        ax.set_title('feature {} vs feature {}'.format(x,y))
    plt.show()

How can I fix the above function to get the clustering output.

Upvotes: 2

Views: 1348

Answers (2)

StupidWolf
StupidWolf

Reputation: 46898

As mentioned in the other answer and comments, you cannot plot all 5 axis together. One way is use dimension reduction such as PCA to reduce it to 2 dimensions and plot:

import numpy as np
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
from sklearn.decomposition import PCA

df = pd.read_csv('test.csv')

model = KMeans(n_clusters=5)
model.fit(df)
yhat = model.predict(df)
clusters = np.unique(yhat)

dims = PCA(n_components=2).fit_transform(X)
dims = pd.DataFrame(dims,columns=['PC1','PC2'])

fig,ax = plt.subplots(1,1) 
for cluster in clusters:
    ix = yhat == cluster
    ax.scatter(x=dims.loc[ix,'PC1'],y=dims.loc[ix,'PC2'],label=cluster)
ax.legend()

enter image description here

Or you do use seaborn and visualize all your variables, which is ok if you only have 5 variables:

import seaborn as sns
df['cluster'] = yhat
sns.pairplot(data=df,hue='cluster',diag_kind=None)

enter image description here

Upvotes: 2

jlesueur
jlesueur

Reputation: 326

Your KMeans work but the way you want to display the result is not proper. If you look at the documentation of matplotlib scatter function (https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html), you will see that the four first arguments of the function can accept an array-like while the fifth only accept a 'MarkerStyle'. That's why you get an error only when when you add the fifth argument. Actually, you are trying to plot a 5 dimension dataset in a 2 dimension plane what is not possible without doing a dimensionality reduction beforehand. A PCA or a PLSDA could be a good option to reduce the dimensionality of your dataset.

Upvotes: 1

Related Questions