CloseToC
CloseToC

Reputation: 165

sklearn's KMeans: Cluster centers and cluster means differ. Numerical Imprecision?

I've noticed that when using sklearn.cluster.KMeans to obtain clusters, the cluster centers, from the method .cluster_centers_, and computing means manually for each cluster don't seem to give exactly the same answer.

For small sample sizes the difference is very small and probably within float imprecision. But for larger samples:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2).fit(x_z)

df = pd.DataFrame(x_z)
df['label'] =  cluster.labels_

difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)

[[ 0.00217333] [ 0.00223798]]

Doing the same thing for different sample sizes: enter image description here

This seems too much to be floating point imprecision. Are cluster centers not means, or what's going on here?

Upvotes: 2

Views: 743

Answers (1)

Mabel Villalba
Mabel Villalba

Reputation: 2598

I think that it may be related to the tolerance of KMeans. The default value is 1e-4, so setting a lower value, i.e. tol=1e-8 gives:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2, tol=1e-8).fit(x_z)

df = pd.DataFrame(x_z)
df['label'] =  cluster.labels_

difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)

                    0
label                
0      9.99200722e-16
1      1.11022302e-16

Hope it helps.

Upvotes: 3

Related Questions