Reputation: 1680
I am using scikit-learn. I want to cluster a 6gb dataset of documents and find clusters of documents.
I only have about 4Gb ram though. Is there a way to get k-means to handle large datasets in scikit-learn?
Thank you, Please let me know if you have any questions.
Upvotes: 1
Views: 2744
Reputation: 363818
Use MiniBatchKMeans
together with HashingVectorizer
; that way, you can learn a cluster model in a single pass over the data, assigning cluster labels as you go or in a second pass. There's an example script that demonstrates MBKM.
Upvotes: 7
Reputation: 1690
Clustering is not in itself that well-defined a problem (a 'good' clustering result depends on your application) and k-means algorithm only gives locally optimal solutions based on random initialization criteria. Therefore I doubt that the results you would get from clustering a random 2GB subsample of the dataset would be qualitatively different from the results you would get clustering over the entire 6GB. I would certainly try clustering on the reduced dataset as a first port of call. Next options are to subsample more intelligently, or do multiple training runs with different subsets and do some kind of selection/ averaging across multiple runs.
Upvotes: 1