Reputation: 73
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = [916,684,613,612,593,552,487,484,475,474,438,431,421,418,409,391,389,388,
380,374,371,369,357,356,340,338,328,317,316,315,313,303,283,257,255,254,245,
234,232,227,227,222,221,221,219,214,201,200,194,169,155,140]
kmeans = KMeans(n_clusters=4)
a = kmeans.fit(np.reshape(x,(len(x),1)))
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","y.","b."]
for i in range(len(x)):
plt.plot(x[i], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0], marker = "x", s = 150, linewidths = 5, zorder = 10)
plt.show()
The code above displays 4 clusters, but they are definitely not something I want to have.
I also get an error, which makes it even worst. The output I get is in the picture below.
The error I get is: TypeError: scatter() missing 1 required positional argument: 'y'
Error is not a big deal because I don't like what I have anyways.
Following is the image of how I want my output of clusters to look like.
Upvotes: 5
Views: 11803
Reputation: 7507
Since you work with only one dimensional, you should understand what exactly you are computing. With KMeans, you extract four average values; the best thing you can do here is draw your data as below with four horizontal lines showing these values. I get the following picture with the code below. This picture is like the equivalent for 1D of the picture you are showing for 2D.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = [916,684,613,612,593,552,487,484,475,474,438,431,421,418,409,391,389,388,
380,374,371,369,357,356,340,338,328,317,316,315,313,303,283,257,255,254,245,
234,232,227,227,222,221,221,219,214,201,200,194,169,155,140]
kmeans = KMeans(n_clusters=4)
a = kmeans.fit(np.reshape(x,(len(x),1)))
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","y.","b."]
for i in centroids: plt.plot( [0, len(x)-1],[i,i], "k" )
for i in range(len(x)):
plt.plot(i, x[i], colors[labels[i]], markersize = 10)
plt.show()
Computing kmeans with 1D data is more interesting with curves like the following one (from the page http://lasp.colorado.edu/home/sorce/2013/01/28/the-sorce-mission-celebrates-ten-years/) because you obviously can see tow distinct average values:
Upvotes: 0
Reputation: 77454
Don't expect a pretty 2d plot without making up data.
To get rid of the warning, you can set y=x
. But it will not change much, the data will continue to be a 1-dimensional line.
You could of course add random noise, and set y to random values. But that means making up fake data.
For one-dimensional algorithm, I recommend to not use clustering at all. These algorithms are designed for complex multivariate data where you cannot afforf a good statistical model anymore. One-dimensional data can be sorted which allows for much more efficient algorithms. You can easily do KDE on such data, and fit thousands of statistical distributions. This will give you a much more meaningful model of higher statistical power.
From a quick look at your plot, I'd say there are no clusters. Instead your data looks like a skewed normal distribution with one clear outlier (to be expected at this data set size) to me. Please, try a more statistical approach.
Upvotes: 3
Reputation: 2253
your data is one-dimension (a line), if you want to visualize in two-dimension like pic in your post, your should use two-dimension or multi-dimension data, for example [[1,3], [2,3], [1,5]].
after k-means they are divided into k clusters, and you can use scatter to visualize the output. by the way, scatter take x and y, scatter is two-dimension visualization.
i suggest you to take a look at Orange, a python data mining tool. you can do k-means by drag and drop.
and visualize the output of k-means easily.
good luck! data mining is fun :-)
Upvotes: 3