Reputation: 1722
I have trained a word2vec model called word_vectors
, using the Gensim package with size = 512.
fname = get_tmpfile('word2vec.model')
word_vectors = KeyedVectors.load(fname, mmap='r')
Now, I have created a new Numpy array (also of size 512) which I have added to the word2vec as follows:
vector = (rand(512)-0.5) *20
word_vectors.add('koffie', vector)
Doing this seems to go fine and even when I call
word_vectors['koffie']
I get the array as output, as expected.
However, when I want to look for the most similar words in my model and run the following code:
word_vectors.most_similar('koffie')
I get the following error:
Traceback (most recent call last):
File "<ipython-input-283-ce992786ce89>", line 1, in <module>
word_vectors.most_similar('koffie')
File "C:\Users\20200016\AppData\Local\Continuum\anaconda3\envs\ldaword2vec\lib\site-packages\gensim\models\keyedvectors.py", line 553, in most_similar
mean.append(weight * self.word_vec(word, use_norm=True))
File "C:\Users\20200016\AppData\Local\Continuum\anaconda3\envs\ldaword2vec\lib\site-packages\gensim\models\keyedvectors.py", line 461, in word_vec
result = self.vectors_norm[self.vocab[word].index]
IndexError: index 146139 is out of bounds for axis 0 with size 146138
word_vector.size()
Traceback (most recent call last):
File "<ipython-input-284-2606aca38446>", line 1, in <module>
word_vector.size()
NameError: name 'word_vector' is not defined
The error seems to indicate that my indexing isn't correct here. But since I am only indexing indirectly (with a key rather than an actual numeric index), I don't see what I need to change here.
Who knows what goes wrong here? And what can I do to overcome this error?
Upvotes: 0
Views: 1101
Reputation: 54153
The 1st time you do a .most_similar()
, a KeyedVectors
instance (in gensim versions through 3.8.3) will create a cache of unit-normalized vectors to assist in all subsequent bulk-similarity operations, and place it in .vectors_norm
.
It looks like your addition of a new vector didn't flush/recalculate/expand that cached .vectors_norm
- originally the KeyedVectors
class and .most_similar()
operation were not designed with constantly-growing or constantly-changing sets-of-vectors in mind, but rather as utilities for a post-training, frozen set of vectors.
So that's the cause of your IndexError
.
You should be able to work-around this by explicitly clearing the .vectors_norm
any time you perform modifications/additions to the KeyedVectors
, eg:
word_vectors.vectors_norm = None
(This shouldn't be necessary in the next 4.0.0 release of gensim, but I'll double-check there's not a similar problem there.)
Separately:
Your 'word_vector' is not defined
error is simply because you seem to have left the 's' off your chosen variable name word_vectors
You probably don't need to be using the gensim-testing-utility-method get_tmpfile()
- just use your own explicit, intentional filesystem paths for saving and loading
Whether it's proper to use KeyedVectors.load()
depends on what was saved. If you are in fact saving a full Word2Vec
class instance (more than just the vectors), using Word2Vec.load()
would be more appropriate.
Upvotes: 1