Out of memory exception with FastText model

Question

This is API I am using: https://radimrehurek.com/gensim/models/fasttext.html

How I create model:

model = FastText(vector_size=300, max_vocab_size=100_000, window=5, min_count=20)

How I load vocabulary:

iterator = MyIterator(lib_path, ds_path, verbose=True)
model.build_vocab(corpus_iterable=iterator)

My iterator:

class MyIterator:
    def __init__(self, lib_path, ds_path, verbose=False):
        self.lib_path = lib_path
        self.ds_path = ds_path
        self.verbose = verbose

    def __iter__(self):
        languages = [
            language for language in os.listdir(self.ds_path) if not language.startswith(".") and language != "markdown"
        ]
        n_languages = len(languages)
        for i, language in enumerate(languages):
            tokenizer = Tokenizer()

            for file in os.listdir(os.path.join(self.ds_path, language)):
                if not file.startswith("."):

                    with open(os.path.join(self.ds_path, language, file), "r", encoding="utf-8") as f:
                        text = f.read()

                    try:
                        if self.verbose:
                            print(f"Parsing - language {language} - {i + 1}/{n_languages} - file {file}")
                        tokens = tokenizer.tokenize(text)

                        yield tokens
                    except:
                        if self.verbose:
                            print(f"Error - couldn't parse {language}/{file}!")
                        pass

Error:

Traceback (most recent call last):
  File "/home/x/dev/embed/train_embeddings.py", line 104, in 
    main()
  File "/home/x/dev/embed/train_embeddings.py", line 94, in main
    model.build_vocab(corpus_iterable=iterator)
  File "/home/x/anaconda3/lib/python3.9/site-packages/gensim/models/word2vec.py", line 488, in build_vocab
    total_words, corpus_count = self.scan_vocab(
  File "/home/x/anaconda3/lib/python3.9/site-packages/gensim/models/word2vec.py", line 583, in scan_vocab
    total_words, corpus_count = self._scan_vocab(corpus_iterable, progress_per, trim_rule)
  File "/home/x/anaconda3/lib/python3.9/site-packages/gensim/models/word2vec.py", line 567, in _scan_vocab
    vocab[word] += 1
MemoryError

I have 64 GB RAM, my custom iterator itarate over files using with, which should close file after block ends, so I have no idea where is possible memory leak

Out of memory exception with FastText model

Answers (1)

Related Questions