Reputation: 156
How is it possible to retrieve the n most frequent words from a Gensim word2vec
model? As I understand, the frequency and count are not the same, and I therefore can't use the object.count()
method.
I need to produce a list of the n most frequent words from my word2vec
model.
Edit:
I've tried the following:
w2c = dict()
for item in model.wv.vocab:
w2c[item]=model.wv.vocab[item].count
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
w2cSortedList = list(w2cSorted.keys())
My initial guess was to use code above, but this implements the count method. I'm not sure if this represents the most frequent words.
Upvotes: 8
Views: 7963
Reputation: 54173
The .count
property of each vocab-entries is the count of that word as seen during the initial vocabulary-survey. So sorting by that, and taking the highest-count
words, will give you the most-frequent words.
But also, for efficiency, it's typical practice for the ordered-list of known-words to be ordered from most- to least-frequent. You can view this at the list model.wv.index_to_key
, so can retrieve the 100 most frequent words by model.wv.index_to_key[:100]
. (In Gensim before version 4.0, this same list was called either index2entity
or index2word
.)
Upvotes: 15