Reputation: 4301
I am training doc2vec with corpus file, which is very huge.
model = Doc2Vec(dm=1, vector_size=200, workers=cores, comment='d2v_model_unigram_dbow_200_v1.0')
model.build_vocab(corpus_file=path)
model.train(corpus_file=path, total_examples=model.corpus_count, epochs=model.iter)
I want to know how to get value of total_words.
Edit:
total_words=model.corpus_total_words
Is this right?
Upvotes: 0
Views: 989
Reputation: 54223
According to the current (gensim 3.8.1, October 2019) Doc2Vec.train()
documentation, you shouldn't need to supply both total_examples
and total_words
, only one or the other:
To support linear learning-rate decay from (initial) alpha to min_alpha, and accurate progress-percentage logging, either total_examples (count of documents) or total_words (count of raw words in documents) MUST be provided. If documents is the same corpus that was provided to build_vocab() earlier, you can simply use total_examples=self.corpus_count.
But, it turns out the new corpus_file
option does require both, and the doc-comment is wrong. I've filed a bug to fix this documentation oversight.
Yes, the model caches the number of words observed during the most-recent build_vocab()
inside model.corpus_total_words
, so total_words=model.corpus_total_words
should do the right thing for you.
When using the corpus_file
space-delimited text input option, then the numbers given by corpus_count
and corpus_total_words
should match the line- and word- counts you'd also see by running wc your_file_path
at a command-line.
(If you were using the classic, plain Python iterable corpus option (which can't use threads as effetively), then there would be no benefit to supplying both total_examples
and total_words
to train()
– it would only use one or the other for estimating progress.)
Upvotes: 1