Reputation: 165
I've noticed that when using sklearn.cluster.KMeans to obtain clusters, the cluster centers, from the method .cluster_centers_, and computing means manually for each cluster don't seem to give exactly the same answer.
For small sample sizes the difference is very small and probably within float imprecision. But for larger samples:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2).fit(x_z)
df = pd.DataFrame(x_z)
df['label'] = cluster.labels_
difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)
[[ 0.00217333] [ 0.00223798]]
Doing the same thing for different sample sizes:
This seems too much to be floating point imprecision. Are cluster centers not means, or what's going on here?
Upvotes: 2
Views: 743
Reputation: 2598
I think that it may be related to the tolerance of KMeans
. The default value is 1e-4
, so setting a lower value, i.e. tol=1e-8
gives:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
np.random.seed(0)
x = np.random.normal(size=5000)
x_z = (x - x.mean() / x.std()).reshape(5000,1)
cluster=KMeans(n_clusters=2, tol=1e-8).fit(x_z)
df = pd.DataFrame(x_z)
df['label'] = cluster.labels_
difference = np.abs(df.groupby('label').mean() - cluster.cluster_centers_)
print(difference)
0
label
0 9.99200722e-16
1 1.11022302e-16
Hope it helps.
Upvotes: 3