Reputation: 41
Gensim Word2Vec that I've trained lacks vectors for some words. That is, although I have a word "yuval" as an input, the model lacks a vector for it. What is the cause?
Upvotes: 0
Views: 210
Reputation: 3154
To expand on @gojomo
's answer, Word2Vec
models during training are told to discard tokens below a min_count
, as they are considered to be uninformative, meaning unable to extrapolate useful context.
This means, those tokens wont have vectors.
To check this, you an load the model, and check that the vocabulary contains the token you are interested in:
>>> import gensim
>>> model = gensim.models.KeyedVectors.load(...)
>>> 'car' in model
True
>>> 'yuval' in model
False
Since 'yuval' is not in the vocabulary, it can't be found with the in
operator, and will throw a key error if used.
>>> model['car']
...
...
<numpy array>
>>> model['yuval']
...
...
KeyError: "word 'yuval' not in vocabulary"
If you really expect that the word should be in the list of vocabulary words, you can always print them out too:
>>> for token in model.wv.vocab.keys():
... print(token)
...
Upvotes: 1
Reputation: 54243
You either didn't supply 'yuval'
as a token with a properly-formatted corpus, or the number of occurrences was below the model's min_count
. (It's generally helpful for a Word2Vec
model to discard low-frequency words – more data isn't automatically better if there are only a few examples of a word.)
Double-check that 'yuval'
appears in the corpus, and how many times, and whether that's sufficient for the word to survive min_count
trimming.
Upvotes: 1