Emil
Emil

Reputation: 1722

IndexError: index is out of bounds - word2vec

I have trained a word2vec model called word_vectors, using the Gensim package with size = 512.

fname = get_tmpfile('word2vec.model')
word_vectors = KeyedVectors.load(fname, mmap='r')

Now, I have created a new Numpy array (also of size 512) which I have added to the word2vec as follows:

vector = (rand(512)-0.5) *20
word_vectors.add('koffie', vector)

Doing this seems to go fine and even when I call

word_vectors['koffie']

I get the array as output, as expected.

However, when I want to look for the most similar words in my model and run the following code:

word_vectors.most_similar('koffie')

I get the following error:

Traceback (most recent call last):

  File "<ipython-input-283-ce992786ce89>", line 1, in <module>
    word_vectors.most_similar('koffie')

  File "C:\Users\20200016\AppData\Local\Continuum\anaconda3\envs\ldaword2vec\lib\site-packages\gensim\models\keyedvectors.py", line 553, in most_similar
    mean.append(weight * self.word_vec(word, use_norm=True))

  File "C:\Users\20200016\AppData\Local\Continuum\anaconda3\envs\ldaword2vec\lib\site-packages\gensim\models\keyedvectors.py", line 461, in word_vec
    result = self.vectors_norm[self.vocab[word].index]

IndexError: index 146139 is out of bounds for axis 0 with size 146138


word_vector.size()
Traceback (most recent call last):

  File "<ipython-input-284-2606aca38446>", line 1, in <module>
    word_vector.size()

NameError: name 'word_vector' is not defined

The error seems to indicate that my indexing isn't correct here. But since I am only indexing indirectly (with a key rather than an actual numeric index), I don't see what I need to change here.

Who knows what goes wrong here? And what can I do to overcome this error?

Upvotes: 0

Views: 1101

Answers (1)

gojomo
gojomo

Reputation: 54153

The 1st time you do a .most_similar(), a KeyedVectors instance (in gensim versions through 3.8.3) will create a cache of unit-normalized vectors to assist in all subsequent bulk-similarity operations, and place it in .vectors_norm.

It looks like your addition of a new vector didn't flush/recalculate/expand that cached .vectors_norm - originally the KeyedVectors class and .most_similar() operation were not designed with constantly-growing or constantly-changing sets-of-vectors in mind, but rather as utilities for a post-training, frozen set of vectors.

So that's the cause of your IndexError.

You should be able to work-around this by explicitly clearing the .vectors_norm any time you perform modifications/additions to the KeyedVectors, eg:

word_vectors.vectors_norm = None

(This shouldn't be necessary in the next 4.0.0 release of gensim, but I'll double-check there's not a similar problem there.)

Separately:

  • Your 'word_vector' is not defined error is simply because you seem to have left the 's' off your chosen variable name word_vectors

  • You probably don't need to be using the gensim-testing-utility-method get_tmpfile() - just use your own explicit, intentional filesystem paths for saving and loading

  • Whether it's proper to use KeyedVectors.load() depends on what was saved. If you are in fact saving a full Word2Vec class instance (more than just the vectors), using Word2Vec.load() would be more appropriate.

Upvotes: 1

Related Questions