Reputation: 249
I have similar questions before but I haven't found a solution that works for me specifically. So I have a million documents and lets say each document has around 20-30 words in it. I want to lemmatize, remove stopwords and use 100,000 words to build a tf-idf matrix and then do SVD on it. How can I do this using Python within reasonable time and without running into memory errors ?
If someone has any idea that would be great.
Upvotes: 0
Views: 90
Reputation: 11
There is an algorithm called SPIMI (single-pass-in-memroy-indexing). It basically involves going through your data and writing to the the disk every time your run out of memory, you then merge all your disk saves into one large matrix. I've implemented this for a project here
Upvotes: 1