I've been trying to cluster some larger dataset. consisting of 50000 measurement vectors with dimension 7. I'm trying to generate about 30 to 300 clusters for further processing. I've been trying the following clustering implementations with no luck: Pycluster.kcluster (gives only 1-2 non-empty clusters on my dataset) scipy.cluster.hierarchy.fclusterdata (runs too long) scipy.cluster.vq.kmeans (runs out of memory) sklearn.cluster.hierarchical.Ward (runs too long) Are there any other implementations which I might miss?

Reputation: 1118

large scale clustering library possibly with python bindings

I've been trying to cluster some larger dataset. consisting of 50000 measurement vectors with dimension 7. I'm trying to generate about 30 to 300 clusters for further processing.

I've been trying the following clustering implementations with no luck:

Pycluster.kcluster (gives only 1-2 non-empty clusters on my dataset)
scipy.cluster.hierarchy.fclusterdata (runs too long)
scipy.cluster.vq.kmeans (runs out of memory)
sklearn.cluster.hierarchical.Ward (runs too long)

Are there any other implementations which I might miss?

Upvotes: 10

Answers (5)

Jules Gagnon-Marchand

Reputation: 3800

The real answer for actually large scale situations is to use something like FAISS, Facebook Research's library for efficient similarity search and clustering of dense vectors.

See https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization

Upvotes: 1

luispedro

Reputation: 7024

My package milk handles this problem easily:

import milk
import numpy as np
data = np.random.rand(50000,7)
%timeit milk.kmeans(data, 300)
1 loops, best of 3: 14.3 s per loop

I wonder whether you meant to write 500,000 data points, because 50k points is not that much. If so, milk takes a while longer (~700 sec), but still handles it well as it does not allocate any memory other than your data and the centroids.

Upvotes: 4

Fred Foo

Reputation: 363818

Since you're already trying scikit-learn: sklearn.cluster.KMeans should scale better than Ward and supports parallel fitting on multicore machines. MiniBatchKMeans is better still, but won't do random restarts for you.

>>> from sklearn.cluster import MiniBatchKMeans
>>> X = np.random.randn(50000, 7)
>>> %timeit MiniBatchKMeans(30).fit(X)
1 loops, best of 3: 114 ms per loop

Upvotes: 7

Has QUIT--Anony-Mousse

Reputation: 77505

50000 instances and 7 dimensions isn't really big, and should not kill an implementation.

Although it doesn't have python binding, give ELKI a try. The benchmark set they use on their homepage is 110250 instances in 8 dimensions, and they run k-means on it in 60 seconds apparently, and the much more advanced OPTICS in 350 seconds.

Avoid hierarchical clustering. It's really only for small data sets. The way it is commonly implemented on matrix operations is O(n^3), which is really bad for large data sets. So I'm not surprised these two timed out for you.

DBSCAN and OPTICS when implemented with index support are O(n log n). When implemented naively, they are in O(n^2). K-means is really fast, but often the results are not satisfactory (because it always splits in the middle). It should run in O(n * k * iter) which usually converges in not too many iterations (iter<<100). But it will only work with Euclidean distance, and just doesn't work well with some data (high-dimensional, discrete, binary, clusters with different sizes, ...)

Upvotes: 13

Hugh Bothwell

Reputation: 56714

OpenCV has a k-means implementation, Kmeans2

Expected running time is on the order of O(n**4) - for an order-of-magnitude approximation, see how long it takes to cluster 1000 points, then multiply that by seven million (50**4 rounded up).

Upvotes: 0

large scale clustering library possibly with python bindings

Answers (5)

Related Questions