Reputation: 33
I have documents with over 37M sentences and I'm using Gensim's Doc2Vec to train them. The model training works fine with smaller data sets, say 5M-10M records. However, when training on the full dataset, the process dies mostly at the "resetting layer weights" stage. Sometimes, it dies before.
I'm suspecting that it's a memory issue. I have 16GB of RAM with 4 cores. If it's indeed a memory issue, is there any way I can train the model in batches. From reading the documentation, it seems train() is useful in cases where the new documents don't have new vocabularies. But, this is not the case with my documents.
Any suggestions?
Upvotes: 1
Views: 1092
Reputation: 54173
It's not the raw size of your corpus, per se, that makes the model larger, but the number of unique words/doc-tags you want the model to train.
If you're using 37 million unique documents, each with its own ID as its doc-tag, and you're using a common vector-size like 300 dimensions, those doc-vectors alone will require:
37 million * 300 dimensions * 4 bytes/dimension = 44.4 GB
More RAM will be required for the unique words and internal model weights, but not as much as these doc-vectors with a normal-size vocabulary and reasonable choice of min_count
to discard rarer words.
Gensim supports streamed training that doesn't require more memory for a larger corpus, but if you want to end up with 47 million 300-dimensional vectors in the same model, that amount of addressable memory will still be required.
Your best bet might be to train a model on some smaller, representative subset – perhaps just a random subset – that fits in addressable memory. Then, when you need vectors for other docs, you could use infer_vector()
to calculate them one-at-a-time, then store them somewhere else. (But, you'd still not have them all in memory, which can be crucial for adequately-fast scans for most_similar()
or other full-corpus comparisons).
Using a machine with tons of RAM makes working with such large vector-sets much easier.
(One other possible trick involves the use of the mapfile_path
parameter – but unless you're familiar with how your operating system handles memory-mapped files, and understand how the big docvecs-array is further used/transformed for your later operations, it may be more trouble than it's worth. It'll also involve a performance hit, which will likely only be tolerable if your docs have a single unique ID tag, so that the pattern of access to the mmapped file is always in training and similarity-searches a simple front-to-back load in the same original order. You can see this answer for more details.)
Upvotes: 0