member555
member555

Reputation: 807

scikit kmeans not accurate cost \ inertia

I want to get the k-means cost (inertia in scikit kmeans). Just to remind:

The cost is the sum of squared distanctes from each point to the nearest cluster.

I get a strange difference between the cost calc of scikit('inertia'),
and my own trivial way for computing the cost

Please see the following example:

p = np.random.rand(1000000,2)
from sklearn.cluster import KMeans
a = KMeans(n_clusters=3).fit(p)
print a.inertia_ , "****"

means = a.cluster_centers_
s = 0
for x in p:
    best = float("inf")
    for y in means:
        if np.linalg.norm(x-y)**2 < best:
            best = np.linalg.norm(x-y)**2
    s += best
print s, "*****"

Where for my run the output is:

66178.4232156 ****
66173.7928716 *****

Where on my own dataset, the result is more significant(20% difference).
Is this a bug in scikit's implementation?

Upvotes: 1

Views: 2158

Answers (1)

lejlot
lejlot

Reputation: 66815

First - it does not seem to be a bug (but for sure ugly inconsistency). Why is that? You need to take a closer look into what is the code actually doing. For this general purpose it calls the cython code from _k_means.pyx

(lines 577-578)

    inertia = _k_means._assign_labels_array(
        X, x_squared_norms, centers, labels, distances=distances)

and what it does is essentialy exactly your code, but... using doubles in C. So maybe it is just a numerical issue? Let us test your code but now, with clear clusters structure (thus there are no points which might be assigned to many centers - depending on numerical accuracy).

import numpy as np
from sklearn.metrics import euclidean_distances

p = np.random.rand(1000000,2)
p[:p.shape[0]/2, :] += 100 #I move half of points far away

from sklearn.cluster import KMeans
a = KMeans(n_clusters=2).fit(p) #changed to two clusters
print a.inertia_ , "****"

means = a.cluster_centers_
s = 0
for x in p:
    best = float("inf")
    for y in means:
        d = (x-y).T.dot(x-y)
        if d < best:
            best = d
    s += best
print s, "*****"

results

166805.190832 ****
166805.190946 *****

makes sense. Thus the problem is with existance of samples "near the boundary" which might be assigned to multiple clusters depending on arithmetic accuracy. Unftunately I was not able to trace exactly where the difference comes from.

The funny thing is that there is actually an inconsistency coming from the fact, that inertia_ field is filled with Cython code, and .score calls NumPy one. Thus if you call

print -a.score(p)

you will get exactly your inertia.

Upvotes: 2

Related Questions