Reputation: 807
I want to get the k-means cost (inertia
in scikit kmeans).
Just to remind:
The cost is the sum of squared distanctes from each point to the nearest cluster.
I get a strange difference between the cost calc of scikit('inertia'),
and my own trivial way for computing the cost
Please see the following example:
p = np.random.rand(1000000,2)
from sklearn.cluster import KMeans
a = KMeans(n_clusters=3).fit(p)
print a.inertia_ , "****"
means = a.cluster_centers_
s = 0
for x in p:
best = float("inf")
for y in means:
if np.linalg.norm(x-y)**2 < best:
best = np.linalg.norm(x-y)**2
s += best
print s, "*****"
Where for my run the output is:
66178.4232156 ****
66173.7928716 *****
Where on my own dataset, the result is more significant(20% difference).
Is this a bug in scikit's implementation?
Upvotes: 1
Views: 2158
Reputation: 66815
First - it does not seem to be a bug (but for sure ugly inconsistency). Why is that? You need to take a closer look into what is the code actually doing. For this general purpose it calls the cython code from _k_means.pyx
(lines 577-578)
inertia = _k_means._assign_labels_array(
X, x_squared_norms, centers, labels, distances=distances)
and what it does is essentialy exactly your code, but... using doubles in C. So maybe it is just a numerical issue? Let us test your code but now, with clear clusters structure (thus there are no points which might be assigned to many centers - depending on numerical accuracy).
import numpy as np
from sklearn.metrics import euclidean_distances
p = np.random.rand(1000000,2)
p[:p.shape[0]/2, :] += 100 #I move half of points far away
from sklearn.cluster import KMeans
a = KMeans(n_clusters=2).fit(p) #changed to two clusters
print a.inertia_ , "****"
means = a.cluster_centers_
s = 0
for x in p:
best = float("inf")
for y in means:
d = (x-y).T.dot(x-y)
if d < best:
best = d
s += best
print s, "*****"
results
166805.190832 ****
166805.190946 *****
makes sense. Thus the problem is with existance of samples "near the boundary" which might be assigned to multiple clusters depending on arithmetic accuracy. Unftunately I was not able to trace exactly where the difference comes from.
The funny thing is that there is actually an inconsistency coming from the fact, that inertia_
field is filled with Cython code, and .score
calls NumPy one. Thus if you call
print -a.score(p)
you will get exactly your inertia.
Upvotes: 2