Phils19
Phils19

Reputation: 156

Gensim (word2vec) retrieve n most frequent words

How is it possible to retrieve the n most frequent words from a Gensim word2vec model? As I understand, the frequency and count are not the same, and I therefore can't use the object.count() method.

I need to produce a list of the n most frequent words from my word2vec model.

Edit:

I've tried the following:

w2c = dict()
for item in model.wv.vocab:
   w2c[item]=model.wv.vocab[item].count
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
w2cSortedList = list(w2cSorted.keys())

My initial guess was to use code above, but this implements the count method. I'm not sure if this represents the most frequent words.

Upvotes: 8

Views: 7963

Answers (1)

gojomo
gojomo

Reputation: 54173

The .count property of each vocab-entries is the count of that word as seen during the initial vocabulary-survey. So sorting by that, and taking the highest-count words, will give you the most-frequent words.

But also, for efficiency, it's typical practice for the ordered-list of known-words to be ordered from most- to least-frequent. You can view this at the list model.wv.index_to_key, so can retrieve the 100 most frequent words by model.wv.index_to_key[:100]. (In Gensim before version 4.0, this same list was called either index2entity or index2word.)

Upvotes: 15

Related Questions