Adding additional words in word2vec or Glove (maybe using gensim)

Question

I have two pretrained word embeddings: Glove.840b.300.txt and custom_glove.300.txt

One is pretrained by Stanford and the other is trained by me. Both have different sets of vocabulary. To reduce oov, I'd like to add words that don't appear in file1 but do appear in file2 to file1. How do I do that easily?

This is how I load and save the files in gensim 3.4.0.

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/thefile')
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

gojomo · Accepted Answer

I don't know an easy way.

In particular, word-vectors that weren't co-trained together won't have compatible/comparable coordinate-spaces. (There's no one right place for a word – just a relatively-good place compared to the other words that are in the same model.)

So, you can't just append the missing words from another model: you'd need to transform them into compatible locations. Fortunately, it seems to work to use some set of shared anchor-words, present in both word-vector-sets, to learn a transformation – then apply that the words you want to move over.

There's a class, [TranslationMatrix][1], and demo notebook in gensim showing this process for language-translation (an application mentioned in the original word2vec papers). You could concievably use this, combined with the ability to append extra vectors to a gensim KeyedVectors instance, to create a new set of vectors with a superset of the words in either of your source models.

Adding additional words in word2vec or Glove (maybe using gensim)

Answers (1)

Related Questions