Andy Huang
Andy Huang

Reputation: 377

Adding words to gensim word2vec model, but it's not shown in model.wv

I have a pre-trained model, but I need to add some new words in it.

I tried:

model.build_vocab([[new_word1, new_word2]], update=True)
model.train([[new_word1, new_word2]], total_examples=model.corpus_count, epochs=model.epochs)

But when I check:

model.wv[new_word1]
model.wv[new_word2]

I got

KeyError: "Key {new_word1} not present"

same as new_word2

I have checked this How to add words and vectors manually to Word2vec gensim?

How can I solve it? Thanks

Upvotes: 0

Views: 368

Answers (1)

gojomo
gojomo

Reputation: 54153

If you enable logging at the INFO level, you may see more hints of where things may not be having the expeted effect.

In particular, the default min_count value used by Word2Vec is 5, meaning any words that appear fewer than 5 times in a corpus fed to .build_vocab() will be ignored. (Ignoring such rare words is almost always the right thing to do with the word2vec algorithm, which can only learn useful word-vectors when there are many varied examples of a word's usage.)

If you test is truly just 2 new words, each with just one use, a model with reasonable defaults will ignore those two single-occurrence words.

Separately: expanding the vocabulary of an existing model is a tricky, error-prone process. Most improvised/naive ways of doing it are unlikely to reliably give good results. In most cases the safer, more robust process would be re-training with all text, old and new, rather than tiny new increments.

Upvotes: 2

Related Questions