High schooler
High schooler

Reputation: 1680

Handling K-means with large dataset 6gb with scikit-learn?

I am using scikit-learn. I want to cluster a 6gb dataset of documents and find clusters of documents.

I only have about 4Gb ram though. Is there a way to get k-means to handle large datasets in scikit-learn?

Thank you, Please let me know if you have any questions.

Upvotes: 1

Views: 2744

Answers (2)

Fred Foo
Fred Foo

Reputation: 363818

Use MiniBatchKMeans together with HashingVectorizer; that way, you can learn a cluster model in a single pass over the data, assigning cluster labels as you go or in a second pass. There's an example script that demonstrates MBKM.

Upvotes: 7

John Greenall
John Greenall

Reputation: 1690

Clustering is not in itself that well-defined a problem (a 'good' clustering result depends on your application) and k-means algorithm only gives locally optimal solutions based on random initialization criteria. Therefore I doubt that the results you would get from clustering a random 2GB subsample of the dataset would be qualitatively different from the results you would get clustering over the entire 6GB. I would certainly try clustering on the reduced dataset as a first port of call. Next options are to subsample more intelligently, or do multiple training runs with different subsets and do some kind of selection/ averaging across multiple runs.

Upvotes: 1

Related Questions