Reputation: 11
This is API I am using: https://radimrehurek.com/gensim/models/fasttext.html
How I create model:
model = FastText(vector_size=300, max_vocab_size=100_000, window=5, min_count=20)
How I load vocabulary:
iterator = MyIterator(lib_path, ds_path, verbose=True)
model.build_vocab(corpus_iterable=iterator)
My iterator:
class MyIterator:
def __init__(self, lib_path, ds_path, verbose=False):
self.lib_path = lib_path
self.ds_path = ds_path
self.verbose = verbose
def __iter__(self):
languages = [
language for language in os.listdir(self.ds_path) if not language.startswith(".") and language != "markdown"
]
n_languages = len(languages)
for i, language in enumerate(languages):
tokenizer = Tokenizer()
for file in os.listdir(os.path.join(self.ds_path, language)):
if not file.startswith("."):
with open(os.path.join(self.ds_path, language, file), "r", encoding="utf-8") as f:
text = f.read()
try:
if self.verbose:
print(f"Parsing - language {language} - {i + 1}/{n_languages} - file {file}")
tokens = tokenizer.tokenize(text)
yield tokens
except:
if self.verbose:
print(f"Error - couldn't parse {language}/{file}!")
pass
Error:
Traceback (most recent call last):
File "/home/x/dev/embed/train_embeddings.py", line 104, in <module>
main()
File "/home/x/dev/embed/train_embeddings.py", line 94, in main
model.build_vocab(corpus_iterable=iterator)
File "/home/x/anaconda3/lib/python3.9/site-packages/gensim/models/word2vec.py", line 488, in build_vocab
total_words, corpus_count = self.scan_vocab(
File "/home/x/anaconda3/lib/python3.9/site-packages/gensim/models/word2vec.py", line 583, in scan_vocab
total_words, corpus_count = self._scan_vocab(corpus_iterable, progress_per, trim_rule)
File "/home/x/anaconda3/lib/python3.9/site-packages/gensim/models/word2vec.py", line 567, in _scan_vocab
vocab[word] += 1
MemoryError
I have 64 GB
RAM, my custom iterator itarate over files using with
, which should close file after block ends, so I have no idea where is possible memory leak
Upvotes: 1
Views: 332
Reputation: 54173
All that your with
does is ensure the file is closed after a read.
Your line, text = f.read()
is still reading an entire file into memory before doing any tokenization – so if any of your input files are large, you're still paying their entire size cost in RAM, as opposed to some iterator that reads each file incrementally (line-by-line or range-by-range) to yield smaller lists-of-tokens.
Also, it's possible your Tokenizer
itself requires a lot of memory, on large texts. Further, note that training FastText
(& similar models) requires multiple passes over the corpus – 1st, once to discover its vocabulary, then epochs
training-passes. Thus, by tokenizing in your iterator, you'll be duplicating that effort every pass. While this isn't the cause of your present error – which is happening before even a single full pass – it may improve future performance, and memory-usage, to manually iterate over your full data once, then writing the tokenized form in space-delimited plain-text back to disk. Then, all subsequent iterations skip the Tokenizer
overhead, using simple whitespace tokenization instead.
I'd recommend:
Upvotes: 2