How to find number of tokens in gensim model

Question

This is the code for my model using Gensim.i run it and it returned a tuple. I wanna know that which one is the number of tokens?

model = gensim.models.Word2Vec(mylist5,size=100, sg=0, window=5, alpha=0.05, min_count=5, workers=12, iter=20, cbow_mean=1, hs=0, negative=15)

model.train(mylist5, total_examples=len(mylist5), epochs=10)

The value that was returned by my model is: I need to know what is this?

 (167131589, 208757070)

I wanna know what is the number of tokens?

gojomo · Accepted Answer

Since you already passed in your mylist5 corpus` when you instantiated the model, it will have automatically done all steps to train the model with that data.

(You don't need to, and almost certainly should not, be calling .train() again. Typically .train() should only be called if you didn't provide any corpus at instnatiation. And in such a case, you'd then call both .build_vocab() and .train().)

As noted by other answerers, the numbers reported by .train() are two tallies of the total tokens seen by the training process. (Most users won't actually need this info.)

If you want to know the number of unique tokens for which the model learned word-vectors, len(model.wv) is one way. (Before Gensim 4.0, len(model.wv.vocab) would have worked.)

How to find number of tokens in gensim model

Answers (2)

Related Questions