Reputation: 22734
I have two pretrained word embeddings: Glove.840b.300.txt
and custom_glove.300.txt
One is pretrained by Stanford and the other is trained by me. Both have different sets of vocabulary. To reduce oov, I'd like to add words that don't appear in file1 but do appear in file2 to file1. How do I do that easily?
This is how I load and save the files in gensim 3.4.0.
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('path/to/thefile')
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
Upvotes: 2
Views: 2574
Reputation: 54210
I don't know an easy way.
In particular, word-vectors that weren't co-trained together won't have compatible/comparable coordinate-spaces. (There's no one right place for a word – just a relatively-good place compared to the other words that are in the same model.)
So, you can't just append the missing words from another model: you'd need to transform them into compatible locations. Fortunately, it seems to work to use some set of shared anchor-words, present in both word-vector-sets, to learn a transformation – then apply that the words you want to move over.
There's a class, [TranslationMatrix][1]
, and demo notebook in gensim showing this process for language-translation (an application mentioned in the original word2vec papers). You could concievably use this, combined with the ability to append extra vectors to a gensim KeyedVectors
instance, to create a new set of vectors with a superset of the words in either of your source models.
Upvotes: 4