mao95
mao95

Reputation: 1122

Manage KeyError with gensim and pretrained word2vec model

I pretrained a word embedding using wang2vec (https://github.com/wlin12/wang2vec), and i loaded it in python through gensim. When i tried to get the vector of some words not in vocabulary, i obviously get:

KeyError: "word 'kjklk' not in vocabulary"

So, i thought about adding an item to the vocabulary to map oov (out of vocabulary) words, let's say <OOV>. Since the vocabulary is in Dict format, i would simply add the item {"<OOV>":0}.

But, i searched an item of the vocabulary, with

model = gensim.models.KeyedVectors.load_word2vec_format(w2v_ext, binary=False, unicode_errors='ignore')
dict(list(model.vocab.items())[5:6])

The output was something like

{'word': <gensim.models.keyedvectors.Vocab at 0x7fc5aa6007b8>}

So, is there a way to add the <OOV> token to the vocabulary of a pretrained word embedding loaded through gensim, and avoid the KeyError? I looked at gensim doc and i found this: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.build_vocab but it seems not work with the update parameter.

Upvotes: 0

Views: 1475

Answers (1)

gojomo
gojomo

Reputation: 54173

Adding a synthetic '<OOV>' token would just let you look up that token, like model['<OOV>'].The model would still give key errors for absent keys like 'kjklk'.

There's no built-in support for adding any such 'catch-all' mapping. Often, ignoring unknown tokens is better than using some plug value (such as a zero-vector or random-vector).

It's fairly idiomatic in Python to explicitly check if a key is present, via the in keyword, if you want to do something different for absent keys. For example:

vector = model['kjklk'] if 'kjklk' in model else DEFAULT_VECTOR

(Notably, the *expr1* if *expr2* else *expr3* defers evaluation of the initial expr1, avoiding KeyError.)

Python also has the defaultdict variant dictionary, which can have a default value returned for any unknown key. See:

https://docs.python.org/3.7/library/collections.html#collections.defaultdict

It'd be possible to try replacing the KeyedVectors vocab dictionary with one of those, if the behavior is really important, but there could be side effects on other code.

Upvotes: 1

Related Questions