sareem
sareem

Reputation: 429

Encoding issue in python while using w2v

I'm writing my first app in python to use word2vec model. Here is my simple code

import gensim, logging
import sys
import warnings
from gensim.models import Word2Vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

def main(): 
    ####LOAD MODEL
    model = Word2Vec.load_word2vec_format('models/vec-cbow.txt', binary=False)  
    model.similarity('man', 'women')

if __name__ == '__main__':
    with warnings.catch_warnings():
        warnings.simplefilter("error")
        #warnings.simplefilter("ignore")
    main()

I getting this the following error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 96-97: invalid continuation byte 

I tried solving it by adding these two lines, but I'm still getting the error.

reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8') #UTF8 #latin-1

The w2v model was trained on English sentences.

EDIT: Here is the full stack:

**%run "...\getSimilarity.py"**
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
**...\getSimilarity.py in <module>()**
     64         warnings.simplefilter("error")
     65         #warnings.simplefilter("ignore")
---> 66     main()

**...\getSimilarity.py in main()**
     30     ####LOAD MODEL
---> 31     model = Word2Vec.load_word2vec_format('models/vec-cbow.txt', binary=False)  # C binary format
     32     model.similarity('man', 'women')

**...\AppData\Local\Enthought\Canopy\User\lib\site-packages\gensim-0.12.4-py2.7-win-amd64.egg\gensim\models\word2vec.pyc in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors)**
   1090             else:
   1091                 for line_no, line in enumerate(fin):
-> 1092                     parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")
   1093                     if len(parts) != vector_size + 1:
   1094                         raise ValueError("invalid vector on line %s (is this really the text format?)" % (line_no))

**...\AppData\Local\Enthought\Canopy\User\lib\site-packages\gensim-0.12.4-py2.7-win-amd64.egg\gensim\utils.pyc in any2unicode(text, encoding, errors)**
    215     if isinstance(text, unicode):
    216         return text
--> 217     return unicode(text, encoding, errors=errors)
    218 to_unicode = any2unicode
    219 

**...\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.6.2.3262.win-x86_64\lib\encodings\utf_8.pyc in decode(input, errors)**
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

**UnicodeDecodeError: 'utf8' codec can't decode bytes in position 96-97: invalid continuation byte** 

Any hints how to solve the problem? Thanks in advance.

Upvotes: 1

Views: 3568

Answers (3)

TitoOrt
TitoOrt

Reputation: 1305

From the gensim FAQ you can that option about setting unicode_errors as 'ignore' or 'replace', which seems to work in some occasions but not all.

But if you look at the specific help of the function, there is also this:

binary is a boolean indicating whether the data is in binary word2vec format

This is beause the word2vec model is saved as binary and not as any encoded string. Therefore, just setting binary = True should work in all these cases.

For example, if you are trying to use the google pre-trained model from here, this should work:

google_model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)

Hope this helps!

Upvotes: 2

aneesh joshi
aneesh joshi

Reputation: 583

The fix is on your side and it is to either:

a) Store your model using a program that understands unicode and utf8 (such as gensim). Some C and Java word2vec tools are known to truncate the strings at byte boundaries, which can result in cutting a multi-byte utf8 character in half, making it non-valid utf8, leading to this error.

b) Set the unicode_errors flag when running load_word2vec_model, e.g. load_word2vec_model(..., unicode_errors='ignore'). Note that this silences the error, but the utf8 problem is still there -- invalid utf8 characters will just be ignored in this case.

Reason:

The strings (words) stored in your model are not valid utf8. By default, gensim decodes the words using the strict encoding settings, which results in the above exception whenever an invalid utf8 sequence is encountered.

--picked up from gensim's FAQ

Upvotes: 0

sareem
sareem

Reputation: 429

I found the solution simply by reading this FAQ page. "The strings (words) stored in your model are not valid utf8. By default, gensim decodes the words using the strict encoding settings, which results in the above exception whenever an invalid utf8 sequence is encountered."

Upvotes: 0

Related Questions