Reputation: 6338
Suppose I have a (possibly) large corpus, about 2.5M of them with 500 features (after running LSI on the original data with gensim). I need the corpus to train my classifiers using scikit-learn. However, I need to first convert the corpus into a numpy array. The corpus creation and classifier trainer are done in two different scripts.
So the problem is that, my collection size is expected to grow, and at this stage I already don't have enough memory (32GB on the machine) to convert all at once (with gensim.matutils.corpus2dense
). In order to work around the problem I am converting one vector after another at a time, but it is very slow.
I have considered dumping the corpus into svmlight format, and have scikit-learn to load it with sklearn.datasets.load_svmlight_file
. But then it would probably mean I will need to load everything into memory at once?
Is there anyway I can efficiently convert from gensim corpus to numpy array (or scipy sparse matrix)?
Upvotes: 4
Views: 2716
Reputation: 3316
I'm not very knowledgable about Gensim, so I hesitate to answer, but here goes:
Your data does not fit in memory so you will have to either stream it (basically what you are doing now) or chunk it out. It looks to me like gensim.utils.chunkize
chunks it out for you, and you should be able to get the dense numpy array that you need with as_numpy=True
. You will have to use the sklearn models that support partial_fit
. These are trained iteratively, one batch at a time. The good ones are the SGD classifier and the Passive-Aggressive Classifier. Make sure to pass the classes argument at least the first time you call partial_fit
. I recommend reading the docs on out-of-core scaling.
Upvotes: 2