Reputation: 86
I'm looking for a way to dinamically add pre-trained word vectors to a word2vec gensim model.
I have a pre-trained word2vec model in a txt (words and their embedding) and I need to get Word Mover's Distance (for example via gensim.models.Word2Vec.wmdistance) between documents in a specific corpus and a new document.
To prevent the need to load the whole vocabulary, I would want to load only the subset of the pre-trained model's words that are found in the corpus. But if the new document has words that are not found in the corpus but they are in the original model vocabulary add them to the model so they are considered in the computation.
What I want is to save RAM, so possible things that would help me:
Thanks in advance.
Upvotes: 6
Views: 2538
Reputation: 4189
You can just use keyedvectors from gensim.models.keyedvectors
. They are very easy to use.
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)
AND if you already have built a model using gensim.models.Word2Vec
you can just do this. suppose I want to add the token <UKN>
with a random vector.
model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length
The complete example would be like this:
import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8") # load dataset as iterable
model = Word2Vec(dataset)
model.wv["<UNK>"] = np.random.rand(100)
Upvotes: 2