random_display_name
random_display_name

Reputation: 11

Out of memory exception with FastText model

This is API I am using: https://radimrehurek.com/gensim/models/fasttext.html

How I create model:

model = FastText(vector_size=300, max_vocab_size=100_000, window=5, min_count=20)

How I load vocabulary:

iterator = MyIterator(lib_path, ds_path, verbose=True)
model.build_vocab(corpus_iterable=iterator)

My iterator:

class MyIterator:
    def __init__(self, lib_path, ds_path, verbose=False):
        self.lib_path = lib_path
        self.ds_path = ds_path
        self.verbose = verbose

    def __iter__(self):
        languages = [
            language for language in os.listdir(self.ds_path) if not language.startswith(".") and language != "markdown"
        ]
        n_languages = len(languages)
        for i, language in enumerate(languages):
            tokenizer = Tokenizer()

            for file in os.listdir(os.path.join(self.ds_path, language)):
                if not file.startswith("."):

                    with open(os.path.join(self.ds_path, language, file), "r", encoding="utf-8") as f:
                        text = f.read()

                    try:
                        if self.verbose:
                            print(f"Parsing - language {language} - {i + 1}/{n_languages} - file {file}")
                        tokens = tokenizer.tokenize(text)

                        yield tokens
                    except:
                        if self.verbose:
                            print(f"Error - couldn't parse {language}/{file}!")
                        pass

Error:

Traceback (most recent call last):
  File "/home/x/dev/embed/train_embeddings.py", line 104, in <module>
    main()
  File "/home/x/dev/embed/train_embeddings.py", line 94, in main
    model.build_vocab(corpus_iterable=iterator)
  File "/home/x/anaconda3/lib/python3.9/site-packages/gensim/models/word2vec.py", line 488, in build_vocab
    total_words, corpus_count = self.scan_vocab(
  File "/home/x/anaconda3/lib/python3.9/site-packages/gensim/models/word2vec.py", line 583, in scan_vocab
    total_words, corpus_count = self._scan_vocab(corpus_iterable, progress_per, trim_rule)
  File "/home/x/anaconda3/lib/python3.9/site-packages/gensim/models/word2vec.py", line 567, in _scan_vocab
    vocab[word] += 1
MemoryError

I have 64 GB RAM, my custom iterator itarate over files using with, which should close file after block ends, so I have no idea where is possible memory leak

Upvotes: 1

Views: 332

Answers (1)

gojomo
gojomo

Reputation: 54173

All that your with does is ensure the file is closed after a read.

Your line, text = f.read() is still reading an entire file into memory before doing any tokenization – so if any of your input files are large, you're still paying their entire size cost in RAM, as opposed to some iterator that reads each file incrementally (line-by-line or range-by-range) to yield smaller lists-of-tokens.

Also, it's possible your Tokenizer itself requires a lot of memory, on large texts. Further, note that training FastText (& similar models) requires multiple passes over the corpus – 1st, once to discover its vocabulary, then epochs training-passes. Thus, by tokenizing in your iterator, you'll be duplicating that effort every pass. While this isn't the cause of your present error – which is happening before even a single full pass – it may improve future performance, and memory-usage, to manually iterate over your full data once, then writing the tokenized form in space-delimited plain-text back to disk. Then, all subsequent iterations skip the Tokenizer overhead, using simple whitespace tokenization instead.

I'd recommend:

  • read, & yield, your corpus in smaller chunks - such as line-by-line rather than file-by-file (unless the files are already very short)
  • do one tokenization pass that writes a more-simple space-delimited sorpus to a disk file, to avoid duplicate work
  • if you keep having problems, add more progress reporting, so you can tell: how many files are completed before the error? on what file had work begun when the error occurred, and does its size/contents appear exceptional?
  • double-check that your iterator is yielding properly-tokenized, reasonably-sized chunks of text with only expected vocabulary – not other unexpected tokens

Upvotes: 2

Related Questions