Reputation: 37
This is the code for my model using Gensim.i run it and it returned a tuple. I wanna know that which one is the number of tokens?
model = gensim.models.Word2Vec(mylist5,size=100, sg=0, window=5, alpha=0.05, min_count=5, workers=12, iter=20, cbow_mean=1, hs=0, negative=15)
model.train(mylist5, total_examples=len(mylist5), epochs=10)
The value that was returned by my model is: I need to know what is this?
(167131589, 208757070)
I wanna know what is the number of tokens?
Upvotes: 0
Views: 916
Reputation: 54223
Since you already passed in your mylist5
corpus` when you instantiated the model, it will have automatically done all steps to train the model with that data.
(You don't need to, and almost certainly should not, be calling .train()
again. Typically .train()
should only be called if you didn't provide any corpus at instnatiation. And in such a case, you'd then call both .build_vocab()
and .train()
.)
As noted by other answerers, the numbers reported by .train()
are two tallies of the total tokens seen by the training process. (Most users won't actually need this info.)
If you want to know the number of unique tokens for which the model learned word-vectors, len(model.wv)
is one way. (Before Gensim 4.0, len(model.wv.vocab)
would have worked.)
Upvotes: 3
Reputation: 965
The Gensim Github Line573 Shows that model.train returns two values trained_word_count, raw_word_count.
"raw_word_count" is the number of words used in training.
"trained_word_count" is number of raw words after ignoring unknown words and trimming the sentence length.
Upvotes: 1