little JJ
little JJ

Reputation: 71

How to load large dataset to gensim word2vec model

So I have multiple text files(around 40). and each file has around 2000 articles (average of 500 words each). And each document is a single line in the text file.

So because of the memory limitations I wanted to use dynamic loading of these text files for training. (Perhaps a iterator class?)

so how do I proceed?

Upvotes: 6

Views: 3020

Answers (1)

gojomo
gojomo

Reputation: 54153

A corpus of 40 text files * 2000 articles * 500 words each equals about 40000000 words in total, which is still pretty small for this kind of work. I'd guess that's under 400MB, uncompressed, on disk. Even if 4x that in RAM, a lot of desktop or cloud machines could easily handle that 1-2GB of text as Python objects, as a list-of-lists-of-string-tokens. So, you may still have the freedom, depending on your system, to work all-in-memory.

But if you don't, that's OK, because the gensim Word2Vec & related classes can easily take all training data from any iterable sequence that provides each item in turn, and such iterables can in fact read text line-by-line from one, or many, files – each time the data is needed.

Most gensim intro Word2Vec tutorials will demonstrate this, with example code (or the use of library utilities) to read from one file, or many.

For example, gensim's included LineSentence class can be instantiated with the path to a single text file, where each line is one text/sentence, and single spaces separate each word. The resulting object is a Python iterable sequence, which can be iterated over to get those lists-of-words as many times as needed. (Behind the scenes, it's opening & stream-reading the file each time, so no more than the current text need ever be in RAM as a time.)

An early gensim Word2Vec tutorial – https://rare-technologies.com/word2vec-tutorial/ – shows a short MySentences Python class that does the same over all files in a single directory:

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()
 
sentences = MySentences('/some/directory') # a memory-friendly iterable
model = gensim.models.Word2Vec(sentences)

For Word2Vec, it doesn't really matter if you provide the text sentence-by-sentence, or paragraph-by-paragraph, or article-by-article. It's the smaller windows of nearby words that drive the results, not the 'chunks' you choose to pass to the algorithm. So, do whatever is easiest. (But, avoid chunks of more than 10000 words at a time in gensim versions through the current gensim-3.8.3 release, as an internal limit will discard words past the 10000 mark for each text.)

However, don't do all training on one batch yourself, then do all training on another batch, etc. Combining all the data into one iterable is best. Then, all examples are consulted for initial vocabulary-discovery, and all examples are trained together, over the automatic multiple training passes – which is best for model convergence. (You don't want all of the early training to be among one set of examples, then all the late training to a different set of examples, as that would imbalance the examples' relative influences, and prevent the model from considering the full variety of training data in each optimization pass.)

Upvotes: 4

Related Questions