Reputation: 137
I'am using the word2vec model and I have a problem with storing and reading it.
import gensim.models.keyedvectors as w2v
from gensim.models import KeyedVectors
word_vectors = w2v.wv
word_vectors.save(filepath + "Vectors.bin")
m = word2vec.KeyedVectors.load_word2vec_format(filepath + "Vectors.bin", binary=True)
I get following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
In the following way the laoding will work:
vectors = KeyedVectors.load(filepath + "Vectors.bin", mmap='r')
But If I then call
vectors.similar_by_word("cat")
I get following error: TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'
What am I doing wrong? How can I use the save_word2vec_format() function?
Upvotes: 0
Views: 4486
Reputation: 54183
Your initial block of code that does import gensim.models.keyedvectors as w2v
and word_vectors = w2v.wv
doesn't make much sense.
Read literally, that is assigning the full Python code module gensim.models.keyedvectors
to the variable w2v
. That module isn't going to have a .wv
property, so I wouldn't expect word_vectors = w2v.wv
to even execute. It certainly wouldn't result in word_vectors
then being an actual set of trained-up word-vectors, unless there's a bunch of other training code you're not showing.
Are you sure the code in your question is representative of what you're actually doing?
Nevertheless, if you did succeed in getting word_vectors
to hold one of gensim
's KeyedVectors
objects, filled with word-vectors you want to save, you then have two choices:
To save the word-vectors in the same format as was used by Google's original word2vec.c
release, you can use the .save_word2vec_format(path, ...)
method. Then, to later reload those vectors, you'd use the matched reloaded_vectors = KeyedVectors.load_word2vec_format(path, ...)
method.
To save the word-vectors in gensim
's own Python-based format, you can use the .save(path)
method. Then, to later reload those vectors, you'd use the matched reloaded_vectors = KeyedVectors.load(path)
method. This approach may save a little more info (if it's present from your training), such as word-counts. For efficiency with larger objects, it may store the bulk of the vectors into a separate file, which should be kept alongside the main path
file if you move the files elsewhere, and allows the option (but not the requirement) of using mmap
options later.
You can't mix and match these formats: a file saved by save_word2vec_format()
can only be read by load_word2vec_format()
, and a file saved by save()
can only be read by load()
.
Regarding your other TypeError
, there's not enough info to speculate what's gone wrong. You'd need to edit your answer to add more details, and make the demonstrating code self-consistent.
For example, you show loading into a variable named vectors
, but then an operation on a variable called model
. This discrepancy hints the problem might be some other mismatch in your un-shown code.
Similarly, if you encounter any error, you should quote exactly the error message and full error stack reported into your question, so answerers can see exactly which lines of code, in your code and the libaries you are relying on, are involved in exactly your error. (This usually helps finger exactly the place where your expectations/code deviate from the library's requirements.)
Upvotes: 2