Reputation: 91

UnicodeDecodeError：'gbk' codec can't decode byte 0x80 in position 0 illegal multibyte sequence

I use python 3.4 with win 7 64-bit system. I ran the following code:

      6   """ load single batch of cifar """
      7   with open(filename, 'r') as f:
----> 8     datadict = pickle.load(f)
      9     X = datadict['data']

The wrong message is UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 0: illegal multibyte sequence

I changed the line 7 as:

      6   """ load single batch of cifar """
      7   with open(filename, 'r'，encoding='utf-8') as f:
----> 8     datadict = pickle.load(f)
      9     X = datadict['data']

The wrong message became UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte.

The message finally points to the Python34\lib\codecs.py in decode(self, input, final).

    311         # decode input (taking the buffer into account)
    312         data = self.buffer + input
--> 313         (result, consumed) = self._buffer_decode(data, self.errors, final)
    314         # keep undecoded input until the next call
    315         self.buffer = data[consumed:]

I further changed the code as:

      6 """ load single batch of cifar """ 
      7 with open(filename, 'rb') as f:
----> 8 datadict = pickle.load(f) 
      9 X = datadict['data'] 10 Y = datadict['labels']

Well, this time is UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 6: ordinal not in range(128).

What is the problem and how to solve it?

Upvotes: 9

Answers (3)

Bruce

Reputation: 2196

If you using python3.7+, you can set a env var to solve this.

export PYTHONUTF8=1  # linux / macOS
set PYTHONUTF8=1  # windows

More info:
https://dev.to/methane/python-use-utf-8-mode-on-windows-212i
https://stackoverflow.com/a/50933341/1745885

Upvotes: 5

varuscn

Reputation: 161

if you will open file with utf-8,then you need write:

open(file_name, 'r', encoding='UTF-8')

if you will open file with GBK,then you need do:

open(file_name, 'rb')

hope to solve your problem!

Upvotes: 11

Martijn Pieters

Reputation: 1123620

Pickle files are binary data files, so you always have to open the file with the 'rb' mode when loading. Don't try to use a text mode here.

You are trying to load a Python 2 pickle that contains string data. You'll have to tell pickle.load() how to convert that data to Python 3 strings, or to leave them as bytes.

The default is to try and decode those strings as ASCII, and that decoding fails. See the pickle.load() documentation:

Optional keyword arguments are fix_imports, encoding and errors, which are used to control compatibility support for pickle stream generated by Python 2. If fix_imports is true, pickle will try to map the old Python 2 names to the new names used in Python 3. The encoding and errors tell pickle how to decode 8-bit string instances pickled by Python 2; these default to ‘ASCII’ and ‘strict’, respectively. The encoding can be ‘bytes’ to read these 8-bit string instances as bytes objects.

Setting the encoding to latin1 allows you to import the data directly:

with open(filename, 'rb') as f:
    datadict = pickle.load(f, encoding='latin1')

It appears that it is the numpy array data that is causing the problems here as all strings in the set use ASCII characters only.

The alternative would by to use encoding='bytes' but then all the filenames and top-level dictionary keys are bytes objects and you'd have to decode those or prefix all your key literals with b.

Upvotes: 16

UnicodeDecodeError：&#39;gbk&#39; codec can&#39;t decode byte 0x80 in position 0 illegal multibyte sequence

Answers (3)

Related Questions

UnicodeDecodeError：'gbk' codec can't decode byte 0x80 in position 0 illegal multibyte sequence