Reputation: 479
I have asked a similar question here: How to apply KMeans to get the centroid using dataframe with multiple features and I received some valuable responses. However, I have not succeeded in getting KMeans clustering working on a dataframe with more than 4 columns.
The dataframe in question has 5 columns as below:
col1,col2,col3,col4,col5
0.54,0.68,0.46,0.98,0.15
0.52,0.44,0.19,0.29,0.44
1.27,1.15,1.32,0.60,0.14
0.88,0.79,0.63,0.58,0.18
1.39,1.15,1.32,0.41,0.44
0.86,0.80,0.65,0.65,0.11
1.68,1.99,3.97,0.16,0.55
0.78,0.63,0.40,0.36,0.10
2.95,2.66,7.11,0.18,0.15
1.44,1.33,1.79,0.24,0.22
I have a simple KMeans clustering python code that I try to apply on the 5 column dataframe like below.
from numpy import unique
from numpy import where
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
X = np.array(df)
model = KMeans(n_clusters=5)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4])
pyplot.show()
When I run the code it complains about the line pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4]
), with the error message 'ValueError: Unrecognized marker style [[0.14 0.44 0.22]]'. However, if I remove the 5th column from the dataframe (i.e. col5) and remove X[row_ix, 4] from the code, the clustering works.
What do I need to do to get KMeans working on my example dataframe?
[Updated: 2 or 3 dimension at a time]
From the previous post, it was suggested I could split the task by representing 2 or 3 dimensions at a time using the below function. However, the function does not produce the expected clustering output (see attached output.png)
def plot(self):
import itertools
combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features
fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination
for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos
for i, index in enumerate(self.clusters):
point = self.X[index].T
# only get the coordinates for this combination:
px, py = point[x], point[y]
ax.scatter(px, py)
for point in self.centroids:
# only get the coordinates for this combination:
px, py = point[x], point[y]
ax.scatter(px, py, marker="x", color='black', linewidth=2)
ax.set_title('feature {} vs feature {}'.format(x,y))
plt.show()
How can I fix the above function to get the clustering output.
Upvotes: 2
Views: 1348
Reputation: 46898
As mentioned in the other answer and comments, you cannot plot all 5 axis together. One way is use dimension reduction such as PCA to reduce it to 2 dimensions and plot:
import numpy as np
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
from sklearn.decomposition import PCA
df = pd.read_csv('test.csv')
model = KMeans(n_clusters=5)
model.fit(df)
yhat = model.predict(df)
clusters = np.unique(yhat)
dims = PCA(n_components=2).fit_transform(X)
dims = pd.DataFrame(dims,columns=['PC1','PC2'])
fig,ax = plt.subplots(1,1)
for cluster in clusters:
ix = yhat == cluster
ax.scatter(x=dims.loc[ix,'PC1'],y=dims.loc[ix,'PC2'],label=cluster)
ax.legend()
Or you do use seaborn and visualize all your variables, which is ok if you only have 5 variables:
import seaborn as sns
df['cluster'] = yhat
sns.pairplot(data=df,hue='cluster',diag_kind=None)
Upvotes: 2
Reputation: 326
Your KMeans work but the way you want to display the result is not proper. If you look at the documentation of matplotlib scatter function (https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html), you will see that the four first arguments of the function can accept an array-like while the fifth only accept a 'MarkerStyle'. That's why you get an error only when when you add the fifth argument. Actually, you are trying to plot a 5 dimension dataset in a 2 dimension plane what is not possible without doing a dimensionality reduction beforehand. A PCA or a PLSDA could be a good option to reduce the dimensionality of your dataset.
Upvotes: 1