Reputation: 34
I want to train my word Embedding from scratch and I use gensim.models.word2vec as my model. My corpus is so large that I can not read it at once , so I divide my corpus file into many parts and train my model iteratively。I find this is helpful:
train(corpus_iterable=None, corpus_file=None, total_examples=None, total_words=None, epochs=None, start_alpha=None, end_alpha=None, word_count=0, queue_factor=2, report_delay=1.0, compute_loss=False, callbacks=(), **kwargs)
UPDATE:
my code is like this:
model = gensim.models.word2vec.Word2Vec.load(init_model)
for i in range(parts):
model.build_vocab(corpus_file=this_part_file_name, update=True)
model.train(corpus_file = this_part_file_name,
total_words=word_count(this_part_file_name) )
Should the parameter total_words be word_count(this_part_file_name)
or word_count(ALL_my_corpus_file)
?
Upvotes: 0
Views: 1455
Reputation: 1520
total_words
is the count of all raw words in the sentences in the corpus. You only have to provide one of the two: total_examples
or total_words
. If you ran build_vocab()
, you may get the value for total words from model.corpus_total_words
.
There is another count - word_count
that refers to the count of words that are already trained. You can set this to 0 if you want to train on all the words, but this is optional.
More info: https://radimrehurek.com/gensim/models/word2vec.html
Upvotes: 1