Reputation: 41

Gensim Word2Vec lacks vectors for input words

Gensim Word2Vec that I've trained lacks vectors for some words. That is, although I have a word "yuval" as an input, the model lacks a vector for it. What is the cause?

Upvotes: 0

Answers (2)

nmlq

Reputation: 3154

To expand on @gojomo's answer, Word2Vec models during training are told to discard tokens below a min_count, as they are considered to be uninformative, meaning unable to extrapolate useful context.

This means, those tokens wont have vectors.

To check this, you an load the model, and check that the vocabulary contains the token you are interested in:

>>> import gensim
>>> model = gensim.models.KeyedVectors.load(...)
>>> 'car' in model
True
>>> 'yuval' in model
False

Since 'yuval' is not in the vocabulary, it can't be found with the in operator, and will throw a key error if used.

>>> model['car']
...
...
<numpy array>
>>> model['yuval']
...
...
KeyError: "word 'yuval' not in vocabulary"

If you really expect that the word should be in the list of vocabulary words, you can always print them out too:

>>> for token in model.wv.vocab.keys():
...     print(token)
...

Upvotes: 1

gojomo

Reputation: 54243

You either didn't supply 'yuval' as a token with a properly-formatted corpus, or the number of occurrences was below the model's min_count. (It's generally helpful for a Word2Vec model to discard low-frequency words – more data isn't automatically better if there are only a few examples of a word.)

Double-check that 'yuval' appears in the corpus, and how many times, and whether that's sufficient for the word to survive min_count trimming.

Upvotes: 1

Gensim Word2Vec lacks vectors for input words

Answers (2)

Related Questions