Reputation: 1314
I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content.
What would be the best way to approach this? I tried using k-means but it does not seem suitable because it needs all documents at once in order to do the calculations.
Are there any cluster algorithms suitable for larger datasets?
For reference: I am using Elasticsearch to store my data.
Upvotes: 2
Views: 1400
Reputation: 77495
There are k-means variants thst process documents one by one,
MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1.
and k-means variants that repeatedly draw a random sample.
D. Sculley (2010). Web Scale K-Means clustering. Proceedings of the 19th international conference on World Wide Web
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
But in the end, it's still useless old k-means. It's a good quantization approach, but not very robust to noise, not capable of handling clusters of different size, non-convex shape, hierarchy (e.g. sports, inside baseball) etc. it's a signal processing technique, not a data organization technique.
So the practical impact of all these is 0. Yes, they can run k-means on insane data - but if you can't make sense of the result, why would you do so?
Upvotes: 0
Reputation: 9393
According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are:
But I can't tell how to apply these on your dataset. It's big - good luck.
For k-means clustering, I recommend to read the dissertation of Ingo Feinerer (2008). This guy is the developer of the tm package (used in R) for text mining via Document-Term-matrices.
The thesis contains case-studies (Ch. 8.1.4 and 9) on applying k-Means and then the Support Vector Machine Classifier on some documents (mailing lists and law texts). The case studies are written in tutorial style, but the dataset are not available.
The process contains lots of intermediate steps of manual inspection.
Upvotes: 1