Reputation: 757
I am using TfIdfVectorizer of sklearn for document clustering. I have 20 million texts, for which i want to compute clusters. But calculating TfIdf matrix is taking too much time and system is getting stuck.
Is there any technique to deal with this problem ? is there any alternative method for this in any python module ?
Upvotes: 0
Views: 296
Reputation: 77485
start small.
First cluster only 100.00 documents. Only once it works (because it probably won't), then think about scaling up. If you don't succeed clustering the subset (and text clusters are usually pretty bad), then you won't fare well on the large set.
Upvotes: 0
Reputation: 5921
Well, a corpus of 20 million texts is very large, and without a meticulous and comprehensive preprocessing nor some good computing instances (i.e. a lot of memory and good CPUs), the TF-IDF calculation may take a lot of time.
What you can do :
Limit your text corpus to some hundred of thousands of samples (let's say 200.000 texts). Having too much texts might not introduce more variance than much smaller (but reasonable) datasets.
Try to preprocess your texts as much as you can. A basic approach would be : tokenize your texts, use stop words, word stemming, use carefully n_grams. Once you've done all these steps, see how much you've reduced the size of your vocabulary. It should be much more smaller than the original one.
If not too big (talking about your dataset), these steps might help you compute the TF-IDF much faster .
Upvotes: 1