kerschi
kerschi

Reputation: 137

Gensim framework: Saving and storing word2vec keyed vectors

I'am using the word2vec model and I have a problem with storing and reading it.

import gensim.models.keyedvectors as w2v
from gensim.models import KeyedVectors

word_vectors = w2v.wv
word_vectors.save(filepath + "Vectors.bin")

m = word2vec.KeyedVectors.load_word2vec_format(filepath + "Vectors.bin", binary=True)

I get following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

In the following way the laoding will work:

vectors = KeyedVectors.load(filepath + "Vectors.bin", mmap='r')

But If I then call

vectors.similar_by_word("cat")

I get following error: TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

What am I doing wrong? How can I use the save_word2vec_format() function?

Upvotes: 0

Views: 4486

Answers (1)

gojomo
gojomo

Reputation: 54183

Your initial block of code that does import gensim.models.keyedvectors as w2v and word_vectors = w2v.wv doesn't make much sense.

Read literally, that is assigning the full Python code module gensim.models.keyedvectors to the variable w2v. That module isn't going to have a .wv property, so I wouldn't expect word_vectors = w2v.wv to even execute. It certainly wouldn't result in word_vectors then being an actual set of trained-up word-vectors, unless there's a bunch of other training code you're not showing.

Are you sure the code in your question is representative of what you're actually doing?

Nevertheless, if you did succeed in getting word_vectors to hold one of gensim's KeyedVectors objects, filled with word-vectors you want to save, you then have two choices:

  • To save the word-vectors in the same format as was used by Google's original word2vec.c release, you can use the .save_word2vec_format(path, ...) method. Then, to later reload those vectors, you'd use the matched reloaded_vectors = KeyedVectors.load_word2vec_format(path, ...) method.

  • To save the word-vectors in gensim's own Python-based format, you can use the .save(path) method. Then, to later reload those vectors, you'd use the matched reloaded_vectors = KeyedVectors.load(path) method. This approach may save a little more info (if it's present from your training), such as word-counts. For efficiency with larger objects, it may store the bulk of the vectors into a separate file, which should be kept alongside the main path file if you move the files elsewhere, and allows the option (but not the requirement) of using mmap options later.

You can't mix and match these formats: a file saved by save_word2vec_format() can only be read by load_word2vec_format(), and a file saved by save() can only be read by load().

Regarding your other TypeError, there's not enough info to speculate what's gone wrong. You'd need to edit your answer to add more details, and make the demonstrating code self-consistent.

For example, you show loading into a variable named vectors, but then an operation on a variable called model. This discrepancy hints the problem might be some other mismatch in your un-shown code.

Similarly, if you encounter any error, you should quote exactly the error message and full error stack reported into your question, so answerers can see exactly which lines of code, in your code and the libaries you are relying on, are involved in exactly your error. (This usually helps finger exactly the place where your expectations/code deviate from the library's requirements.)

Upvotes: 2

Related Questions