Trouble with scipy kmeans and kmeans2 clustering in Python

Question

I have a question about scipy's kmeans and kmeans2. I have a set of 1700 lat-long data points. I want to spatially cluster them into 100 clusters. However, I get drastically different results when using kmeans vs kmeans2. Can you explain why this is? My code is below.

First I load my data and plot the coordinates. It all looks correct.

import pandas as pd, numpy as np, matplotlib.pyplot as plt
from scipy.cluster.vq import kmeans, kmeans2, whiten

df = pd.read_csv('data.csv')
df.head()

coordinates = df.as_matrix(columns=['lon', 'lat'])
plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(coordinates[:,0], coordinates[:,1], c='c', s=100)
plt.show()

Screenshot

Next, I whiten the data and run kmeans() and kmeans2(). When I plot the centroids from kmeans(), it looks about right - i.e. approximately 100 points that more or less represent the locations of the full 1700 point data set.

N = len(coordinates)
w = whiten(coordinates)
k = 100
i = 20

cluster_centroids1, distortion = kmeans(w, k, iter=i)
cluster_centroids2, closest_centroids = kmeans2(w, k, iter=i)

plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(cluster_centroids1[:,0], cluster_centroids1[:,1], c='r', s=100)
plt.show()

However, when I next plot the centroids from kmeans2(), it looks totally wonky to me. I would expect the results from kmeans and kmeans2 to be fairly similar, but they are completely different. While the result from kmeans does appear to simply yet represent my full data set, the result from kmeans2 looks nearly random.

plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(cluster_centroids2[:,0], cluster_centroids2[:,1], c='r', s=100)
plt.show()

Here are my values for k and N, along with the size of the arrays resulting from kmeans() and kmeans2():

print 'k =', k
print 'N =', N
print len(cluster_centroids1)
print len(cluster_centroids2)
print len(closest_centroids)
print len(np.unique(closest_centroids))

Output:

Why would len(cluster_centroids1) not be equal to k?
len(closest_centroids) is equal to N, which seems correct. But why would len(np.unique(closest_centroids)) not be equal to k?
len(cluster_centroids2) is equal to k, but again, when plotted, cluster_centroids2 doesn't seem to represent the original data set the way cluster_centroids1 does.

Lastly, I plot my full coordinate data set, colored by cluster.

plt.figure(figsize=(10, 6), dpi=100)
plt.scatter(coordinates[:,0], coordinates[:,1], c=closest_centroids, s=100)
plt.show()

You can see it here: Screenshot

DrV · Accepted Answer

Thank you for the good question with the sample code and images! This is a good newbie question.

Most of the peculiarities can be solved by careful reading of the docs. A few things:

When comparing the original set of points and the resulting cluster centers, you should try and plot them in the same plot with the same dimensions (i.e., w agains the results). For example, plot the cluster centers with the large dots as you've done and original data with small dots on top of it.
kmeans and kmeans2 start from different situation. kmeans2 starts from random distribution of points, and as your data is not evenly distributed, kmeans2 converges into a non-ideal result. You might try to add keyword minit='points' and see if the results change.
As the initial centroid choice is a bad one, only 17 of the initial 100 centroids actually have any points belonging to them (this is closely related to the random look of the graph).
It seems that some centroids in kmeans may collapse into each other if that gives the smallest distortion. (This does not seem tp be documented.) Thus you will get only 96 centroids.

Trouble with scipy kmeans and kmeans2 clustering in Python

Answers (1)

Related Questions