neurix
neurix

Reputation: 4316

Reduce Google's Word2Vec model with Gensim

Loading the complete pre-trained word2vec model by Google is time intensive and tedious, therefore I was wondering if there is a chance to remove words below a certain frequency to bring the vocab count down to e.g. 200k words.

I found Word2Vec methods in the gensim package to determine the word frequency and to re-save the model again, but I am not sure how to pop/remove vocab from the pre-trained model before saving it again. I couldn't find any hint in the KeyedVector class and the Word2Vec class for such an operation?

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py

How can I select a subset of the vocabulary of the pre-trained word2vec model?

Upvotes: 9

Views: 4118

Answers (2)

gojomo
gojomo

Reputation: 54153

The GoogleNews word-vectors file format doesn't include frequency info. But, it does seem to be sorted in roughly more-frequent to less-frequent order.

And, load_word2vec_format() offers an optional limit parameter that only reads that many vectors from the given file.

So, the following should do roughly what you've requested:

goognews_wordecs = KeyedVectors.load_word2vec_format(`GoogleNews-vectors-negative300.bin.gz`, binary=True, limit=200000)

Upvotes: 8

Luke Barker
Luke Barker

Reputation: 915

Do you know about this open list/set of pretrained models - maybe an alternative one would be beneficial to the jumbo Google one? :)

https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models

I don't know how to do your precise need, but on the Google group there is some discussion on trimming models that might be of use: https://groups.google.com/forum/#!topic/gensim/wkVhcuyj0Sg

They reference a recent change also on minimising the model but I knwo that is not exactly what you want.

https://github.com/RaRe-Technologies/gensim/pull/987

Upvotes: 4

Related Questions