Nilan Saha
Nilan Saha

Reputation: 249

How to perform LSA on a huge dataset that does not fit into memory with Python?

I have similar questions before but I haven't found a solution that works for me specifically. So I have a million documents and lets say each document has around 20-30 words in it. I want to lemmatize, remove stopwords and use 100,000 words to build a tf-idf matrix and then do SVD on it. How can I do this using Python within reasonable time and without running into memory errors ?

If someone has any idea that would be great.

Upvotes: 0

Views: 90

Answers (1)

JeanWolf
JeanWolf

Reputation: 11

There is an algorithm called SPIMI (single-pass-in-memroy-indexing). It basically involves going through your data and writing to the the disk every time your run out of memory, you then merge all your disk saves into one large matrix. I've implemented this for a project here

Upvotes: 1

Related Questions