Reputation: 95
I am using the MiniBatchKMeans
model from the sklearn.cluster
module in anaconda. I am clustering a data-set that contains approximately 75,000 points. It looks something like this:
data = np.array([8,3,1,17,5,21,1,7,1,26,323,16,2334,4,2,67,30,2936,2,16,12,28,1,4,190...])
I fit the data using the process below.
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(batch_size=100)
kmeans.fit(data.reshape(-1,1)
This is all well and okay, and I proceed to find the centroids of the data:
centroids = kmeans.cluster_centers_
print centroids
Which gives me the following output:
array([[ 13.09716569],
[ 2908.30379747],
[ 46.05089228],
[ 725.83453237],
[ 95.39868475],
[ 1508.38356164],
[ 175.48099948],
[ 350.76287263]])
But, when I run the process again, using the same data, I get different values for the centroids, such as this:
array([[ 29.63143489],
[ 1766.7244898 ],
[ 171.04417206],
[ 2873.70454545],
[ 70.05295277],
[ 1074.50387597],
[ 501.36134454],
[ 8.30600975]])
Can anyone explain why this is?
Upvotes: 1
Views: 1472
Reputation: 77454
Read up on what mini-batch k-means is.
It will never even converge. Do one more iteration, the result will change again.
It is design for data sets so huge you cannot load them into memory at once. So you load a batch, pretend this were the full data set, do one iterarion. Repeat woth the next batch. If your batches are large enough and random, then the result will be "close enough" to be usable. While itjs never optimal.
Thus:
P.S. on one-dimensional data, sort your data set and then use an algorithm that benefits from the sorting instead of k-means.
Upvotes: 2
Reputation: 564
The behavior you are experiencing probably has to do with the under the hood implementation of k-means clustering that you are using. k-means clustering is an NP-hard problem, so all the implementations out there are heuristic methods. What this means practically is that for a given seed, it will converge toward a local optima that isn't necessarily consistent across multiple seeds.
Upvotes: 1