Leslie LIU
Leslie LIU

Reputation: 1

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 0: unexpected end of data

I am using FastText.load_fasttext_format()to load fastText Official Japanese trained model (300 dim) in Google Colab.

Here is my code.

model_path = "/content/drive/MyDrive/IDR/rakuten/wikipedia_fastText/cc.ja.300.bin"
model = FastText.load_fasttext_format(model_path)

And here is the encoding error.

---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

<ipython-input-7-61d7c85f09b2> in <module>()
      2 
      3 model_path = "/content/drive/MyDrive/IDR/rakuten/wikipedia_fastText/cc.ja.300.bin"
----> 4 model = FastText.load_fasttext_format(model_path)

2 frames

/usr/local/lib/python3.7/dist-packages/gensim/models/fasttext.py in _load_dict(self, file_handle, encoding)
    818                 word_bytes += char_byte
    819                 char_byte = file_handle.read(1)
--> 820             word = word_bytes.decode(encoding)
    821             count, _ = self.struct_unpack(file_handle, '@qb')
    822 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 0: unexpected end of data

Upvotes: 0

Views: 567

Answers (1)

gojomo
gojomo

Reputation: 54233

The specific error seems to be unexpected end of data.

Are you sure the cc.ja.300.bin file you've downloaded is the full untruncated length, and uncorrupted contents to match any declared checksum, from the source where it was downloaded?

Separately, the load_fasttext_format() class method is deprecated in current versions of Gensim, with load_facebook_model() now the preferred form (though this wouldn't account for your error).

Upvotes: 1

Related Questions