Reputation: 1524
I have a question about scipy's kmeans
and kmeans2
. I have a set of 1700 lat-long data points. I want to spatially cluster them into 100 clusters. However, I get drastically different results when using kmeans
vs kmeans2
. Can you explain why this is? My code is below.
First I load my data and plot the coordinates. It all looks correct.
import pandas as pd, numpy as np, matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans, kmeans2, whiten
df = pd.read_csv('data.csv')
df.head()
coordinates = df.as_matrix(columns=['lon', 'lat'])
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(coordinates[:,0], coordinates[:,1], c='c', s=100)
plt.show()
Next, I whiten the data and run kmeans()
and kmeans2()
. When I plot the centroids from kmeans()
, it looks about right - i.e. approximately 100 points that more or less represent the locations of the full 1700 point data set.
N = len(coordinates)
w = whiten(coordinates)
k = 100
i = 20
cluster_centroids1, distortion = kmeans(w, k, iter=i)
cluster_centroids2, closest_centroids = kmeans2(w, k, iter=i)
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(cluster_centroids1[:,0], cluster_centroids1[:,1], c='r', s=100)
plt.show()
However, when I next plot the centroids from kmeans2()
, it looks totally wonky to me. I would expect the results from kmeans
and kmeans2
to be fairly similar, but they are completely different. While the result from kmeans
does appear to simply yet represent my full data set, the result from kmeans2
looks nearly random.
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(cluster_centroids2[:,0], cluster_centroids2[:,1], c='r', s=100)
plt.show()
Here are my values for k and N, along with the size of the arrays resulting from kmeans()
and kmeans2()
:
print 'k =', k
print 'N =', N
print len(cluster_centroids1)
print len(cluster_centroids2)
print len(closest_centroids)
print len(np.unique(closest_centroids))
Output:
k = 100
N = 1759
96
100
1759
17
len(cluster_centroids1)
not be equal to k
? len(closest_centroids)
is equal to N
, which seems correct. But why would len(np.unique(closest_centroids))
not be equal to k
?len(cluster_centroids2)
is equal to k
, but again, when plotted, cluster_centroids2
doesn't seem to represent the original data set the way cluster_centroids1
does.Lastly, I plot my full coordinate data set, colored by cluster.
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(coordinates[:,0], coordinates[:,1], c=closest_centroids, s=100)
plt.show()
You can see it here:
Upvotes: 4
Views: 6619
Reputation: 23540
Thank you for the good question with the sample code and images! This is a good newbie question.
Most of the peculiarities can be solved by careful reading of the docs. A few things:
When comparing the original set of points and the resulting cluster centers, you should try and plot them in the same plot with the same dimensions (i.e., w
agains the results). For example, plot the cluster centers with the large dots as you've done and original data with small dots on top of it.
kmeans
and kmeans2
start from different situation. kmeans2
starts from random distribution of points, and as your data is not evenly distributed, kmeans2
converges into a non-ideal result. You might try to add keyword minit='points'
and see if the results change.
As the initial centroid choice is a bad one, only 17 of the initial 100 centroids actually have any points belonging to them (this is closely related to the random look of the graph).
It seems that some centroids in kmeans
may collapse into each other if that gives the smallest distortion. (This does not seem tp be documented.) Thus you will get only 96 centroids.
Upvotes: 1