Reputation: 429
I'm writing my first app in python to use word2vec model. Here is my simple code
import gensim, logging
import sys
import warnings
from gensim.models import Word2Vec
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
def main():
####LOAD MODEL
model = Word2Vec.load_word2vec_format('models/vec-cbow.txt', binary=False)
model.similarity('man', 'women')
if __name__ == '__main__':
with warnings.catch_warnings():
warnings.simplefilter("error")
#warnings.simplefilter("ignore")
main()
I getting this the following error:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 96-97: invalid continuation byte
I tried solving it by adding these two lines, but I'm still getting the error.
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8') #UTF8 #latin-1
The w2v model was trained on English sentences.
EDIT: Here is the full stack:
**%run "...\getSimilarity.py"**
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
**...\getSimilarity.py in <module>()**
64 warnings.simplefilter("error")
65 #warnings.simplefilter("ignore")
---> 66 main()
**...\getSimilarity.py in main()**
30 ####LOAD MODEL
---> 31 model = Word2Vec.load_word2vec_format('models/vec-cbow.txt', binary=False) # C binary format
32 model.similarity('man', 'women')
**...\AppData\Local\Enthought\Canopy\User\lib\site-packages\gensim-0.12.4-py2.7-win-amd64.egg\gensim\models\word2vec.pyc in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors)**
1090 else:
1091 for line_no, line in enumerate(fin):
-> 1092 parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")
1093 if len(parts) != vector_size + 1:
1094 raise ValueError("invalid vector on line %s (is this really the text format?)" % (line_no))
**...\AppData\Local\Enthought\Canopy\User\lib\site-packages\gensim-0.12.4-py2.7-win-amd64.egg\gensim\utils.pyc in any2unicode(text, encoding, errors)**
215 if isinstance(text, unicode):
216 return text
--> 217 return unicode(text, encoding, errors=errors)
218 to_unicode = any2unicode
219
**...\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.6.2.3262.win-x86_64\lib\encodings\utf_8.pyc in decode(input, errors)**
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
**UnicodeDecodeError: 'utf8' codec can't decode bytes in position 96-97: invalid continuation byte**
Any hints how to solve the problem? Thanks in advance.
Upvotes: 1
Views: 3568
Reputation: 1305
From the gensim FAQ you can that option about setting unicode_errors as 'ignore' or 'replace', which seems to work in some occasions but not all.
But if you look at the specific help of the function, there is also this:
binary is a boolean indicating whether the data is in binary word2vec format
This is beause the word2vec model is saved as binary and not as any encoded string. Therefore, just setting binary = True
should work in all these cases.
For example, if you are trying to use the google pre-trained model from here, this should work:
google_model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)
Hope this helps!
Upvotes: 2
Reputation: 583
The fix is on your side and it is to either:
a) Store your model using a program that understands unicode and utf8 (such as gensim). Some C and Java word2vec tools are known to truncate the strings at byte boundaries, which can result in cutting a multi-byte utf8 character in half, making it non-valid utf8, leading to this error.
b) Set the unicode_errors flag when running load_word2vec_model, e.g. load_word2vec_model(..., unicode_errors='ignore'). Note that this silences the error, but the utf8 problem is still there -- invalid utf8 characters will just be ignored in this case.
Reason:
The strings (words) stored in your model are not valid utf8. By default, gensim decodes the words using the strict encoding settings, which results in the above exception whenever an invalid utf8 sequence is encountered.
--picked up from gensim's FAQ
Upvotes: 0
Reputation: 429
I found the solution simply by reading this FAQ page. "The strings (words) stored in your model are not valid utf8. By default, gensim decodes the words using the strict encoding settings, which results in the above exception whenever an invalid utf8 sequence is encountered."
Upvotes: 0