Nomiluks
Nomiluks

Reputation: 2092

How to add missing words vectors in GoogleNews-vectors-negative300.bin pre-trained model?

I am using gensim word2vec library in python and using pre-trained GoogleNews-vectors-negative300.bin model. But,

I have words in my corpus for which i don't have word vectors and am getting keyError for that how do i solve this problem?

Here is what i have tried so far,

1: Loading GoogleNews-vectors-negative300.bin per-trained model:

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print "model loaded..."

2: Build word vector for training set by using the average value of all word vectors in the tweet, then scale

def buildWordVector(text, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in text:
    try:
        vec += model[word].reshape((1, size))
        count += 1.
        #print "found! ",  word
    except KeyError:
        print "not found! ",  word #missing words
        continue
if count != 0:
    vec /= count
return vec

trained_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])

Please tell how it is possible to add new words in pre-trained Word2vec model?

Upvotes: 2

Views: 4857

Answers (1)

kampta
kampta

Reputation: 4898

EDIT 2019/06/07

as pointed out by @Oleg Melnikov and https://rare-technologies.com/word2vec-tutorial/#online_training__resuming, it is not possible to resume training without the vocab tree (which doesn't get saved after training with the C code is complete)

Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.


  1. Get pre-trained vectors - eg. Google News

  2. Load the model in gensim

  3. Continue training the model in gensim

These commands might come in handy

# Loading pre-trained vectors
model = Word2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True)

# Training the model with list of sentences (with 4 CPU cores)
model.train(sentences, workers=4)

Upvotes: 1

Related Questions