Reputation: 4316
Loading the complete pre-trained word2vec model by Google is time intensive and tedious, therefore I was wondering if there is a chance to remove words below a certain frequency to bring the vocab
count down to e.g. 200k words.
I found Word2Vec methods in the gensim
package to determine the word frequency and to re-save the model again, but I am not sure how to pop
/remove
vocab from the pre-trained model before saving it again. I couldn't find any hint in the KeyedVector class
and the Word2Vec class
for such an operation?
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py
How can I select a subset of the vocabulary of the pre-trained word2vec model?
Upvotes: 9
Views: 4118
Reputation: 54153
The GoogleNews word-vectors file format doesn't include frequency info. But, it does seem to be sorted in roughly more-frequent to less-frequent order.
And, load_word2vec_format()
offers an optional limit
parameter that only reads that many vectors from the given file.
So, the following should do roughly what you've requested:
goognews_wordecs = KeyedVectors.load_word2vec_format(`GoogleNews-vectors-negative300.bin.gz`, binary=True, limit=200000)
Upvotes: 8
Reputation: 915
Do you know about this open list/set of pretrained models - maybe an alternative one would be beneficial to the jumbo Google one? :)
https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models
I don't know how to do your precise need, but on the Google group there is some discussion on trimming models that might be of use: https://groups.google.com/forum/#!topic/gensim/wkVhcuyj0Sg
They reference a recent change also on minimising the model but I knwo that is not exactly what you want.
https://github.com/RaRe-Technologies/gensim/pull/987
Upvotes: 4