Reputation: 2092
I am using gensim word2vec library in python and using pre-trained GoogleNews-vectors-negative300.bin model. But,
I have words in my corpus for which i don't have word vectors and am getting keyError for that how do i solve this problem?
GoogleNews-vectors-negative300.bin
per-trained model:model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print "model loaded..."
def buildWordVector(text, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in text:
try:
vec += model[word].reshape((1, size))
count += 1.
#print "found! ", word
except KeyError:
print "not found! ", word #missing words
continue
if count != 0:
vec /= count
return vec
trained_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])
Please tell how it is possible to add new words in pre-trained Word2vec model?
Upvotes: 2
Views: 4857
Reputation: 4898
EDIT 2019/06/07
as pointed out by @Oleg Melnikov and https://rare-technologies.com/word2vec-tutorial/#online_training__resuming, it is not possible to resume training without the vocab tree (which doesn't get saved after training with the C code is complete)
Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.
Get pre-trained vectors - eg. Google News
Load the model in gensim
Continue training the model in gensim
These commands might come in handy
# Loading pre-trained vectors
model = Word2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True)
# Training the model with list of sentences (with 4 CPU cores)
model.train(sentences, workers=4)
Upvotes: 1