total_words must be provided alongside corpus_file argument

Question

I am training doc2vec with corpus file, which is very huge.

model = Doc2Vec(dm=1, vector_size=200, workers=cores, comment='d2v_model_unigram_dbow_200_v1.0')
model.build_vocab(corpus_file=path)
model.train(corpus_file=path, total_examples=model.corpus_count, epochs=model.iter)

I want to know how to get value of total_words.

Edit:

total_words=model.corpus_total_words

Is this right?

gojomo · Accepted Answer

According to the current (gensim 3.8.1, October 2019) Doc2Vec.train() documentation, you shouldn't need to supply both total_examples and total_words, only one or the other:

To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either total_examples (count of documents) or total_words (count of raw words in documents) MUST be provided. If documents is the same corpus that was provided to build_vocab() earlier, you can simply use total_examples=self.corpus_count.

But, it turns out the new corpus_file option does require both, and the doc-comment is wrong. I've filed a bug to fix this documentation oversight.

Yes, the model caches the number of words observed during the most-recent build_vocab() inside model.corpus_total_words, so total_words=model.corpus_total_words should do the right thing for you.

When using the corpus_file space-delimited text input option, then the numbers given by corpus_count and corpus_total_words should match the line- and word- counts you'd also see by running wc your_file_path at a command-line.

(If you were using the classic, plain Python iterable corpus option (which can't use threads as effetively), then there would be no benefit to supplying both total_examples and total_words to train() – it would only use one or the other for estimating progress.)

total_words must be provided alongside corpus_file argument

Answers (1)

Related Questions