How to convert Gensim corpus to numpy array (or scipy sparse matrix) efficiently?

Question

Suppose I have a (possibly) large corpus, about 2.5M of them with 500 features (after running LSI on the original data with gensim). I need the corpus to train my classifiers using scikit-learn. However, I need to first convert the corpus into a numpy array. The corpus creation and classifier trainer are done in two different scripts.

So the problem is that, my collection size is expected to grow, and at this stage I already don't have enough memory (32GB on the machine) to convert all at once (with gensim.matutils.corpus2dense). In order to work around the problem I am converting one vector after another at a time, but it is very slow.

I have considered dumping the corpus into svmlight format, and have scikit-learn to load it with sklearn.datasets.load_svmlight_file. But then it would probably mean I will need to load everything into memory at once?

Is there anyway I can efficiently convert from gensim corpus to numpy array (or scipy sparse matrix)?

How to convert Gensim corpus to numpy array (or scipy sparse matrix) efficiently?

Answers (1)

Related Questions