xx liu
xx liu

Reputation: 34

what does 'corpus_count' in gensim word2vec?

I want to train my word Embedding from scratch and I use gensim.models.word2vec as my model. My corpus is so large that I can not read it at once , so I divide my corpus file into many parts and train my model iteratively。I find this is helpful:

train(corpus_iterable=None, corpus_file=None, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=(), **kwargs)

I confused about the parameter "total_words" . Is it means total words of all my corpus or the part corpus trained now?

UPDATE:

my code is like this:

model =  gensim.models.word2vec.Word2Vec.load(init_model)  
for i in range(parts):
    model.build_vocab(corpus_file=this_part_file_name, update=True)
    model.train(corpus_file = this_part_file_name, 
                   total_words=word_count(this_part_file_name) )

Should the parameter total_words be word_count(this_part_file_name) or word_count(ALL_my_corpus_file) ?

Upvotes: 0

Views: 1455

Answers (1)

Zeitgeist
Zeitgeist

Reputation: 1520

total_words is the count of all raw words in the sentences in the corpus. You only have to provide one of the two: total_examples or total_words. If you ran build_vocab(), you may get the value for total words from model.corpus_total_words. There is another count - word_count that refers to the count of words that are already trained. You can set this to 0 if you want to train on all the words, but this is optional.

More info: https://radimrehurek.com/gensim/models/word2vec.html

Upvotes: 1

Related Questions