fwind
fwind

Reputation: 1314

How to cluster large datasets

I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content.

What would be the best way to approach this? I tried using k-means but it does not seem suitable because it needs all documents at once in order to do the calculations.

Are there any cluster algorithms suitable for larger datasets?

For reference: I am using Elasticsearch to store my data.

Upvotes: 2

Views: 1400

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77495

There are k-means variants thst process documents one by one,

MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1.

and k-means variants that repeatedly draw a random sample.

D. Sculley (2010). Web Scale K-Means clustering. Proceedings of the 19th international conference on World Wide Web

Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.

But in the end, it's still useless old k-means. It's a good quantization approach, but not very robust to noise, not capable of handling clusters of different size, non-convex shape, hierarchy (e.g. sports, inside baseball) etc. it's a signal processing technique, not a data organization technique.

So the practical impact of all these is 0. Yes, they can run k-means on insane data - but if you can't make sense of the result, why would you do so?

Upvotes: 0

knb
knb

Reputation: 9393

According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are:

  • Combination of k-means and agglomerative clustering (bottom-up)
  • topic modeling
  • co-clustering.

But I can't tell how to apply these on your dataset. It's big - good luck.

For k-means clustering, I recommend to read the dissertation of Ingo Feinerer (2008). This guy is the developer of the tm package (used in R) for text mining via Document-Term-matrices.

The thesis contains case-studies (Ch. 8.1.4 and 9) on applying k-Means and then the Support Vector Machine Classifier on some documents (mailing lists and law texts). The case studies are written in tutorial style, but the dataset are not available.

The process contains lots of intermediate steps of manual inspection.

Upvotes: 1

Related Questions