user8597915
user8597915

Reputation:

UnicodeDecodeError in reading a file

When I read a whole file, my script works fine without a problem

fst = 0
with open(in_ckfile, 'rb', 0) as file:
    with open(outfile_namepath, mode='wb') as outfile:
        while True:
            #buf = file.read(204800)
            buf = file.read()
                    
            if buf: 
                fst += 1
                print('read no., len of buf ......: ', fst, len(buf))

                buf = buf.decode()
                xbytes = bytearray()
                xbytes.extend(map(ord, buf))  
                buf = xbytes

                print('read no., len of decode buf: ', fst, len(buf))

And, the result of the process is as shown below::

read no., len of buf ......:  1 26848013
read no., len of decode buf:  1 18546777
len of in string ..........:  18546777
len of output str, checked :  18546777 370130 

However, when I divide the reading by units as: buf = file.read(204800) it gives an error:

read no., len of buf ......:  1 204800
read no., len of decode buf:  1 141406
len of in string ..........:  141406
len of output str, checked :  141406 2827 

read no., len of buf ......:  2 204800
read no., len of decode buf:  2 141606
len of in string ..........:  141606
len of output str, checked :  141606 2800 

read no., len of buf ......:  3 204800
Traceback (most recent call last):
  File "<pyshell#155>", line 1, in <module>
  ...
  buf = buf.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 204799: unexpected end of data

How do I fix the issue

Upvotes: 0

Views: 96

Answers (1)

Jiř&#237; Baum
Jiř&#237; Baum

Reputation: 6940

In UTF-8, many characters are encoded as multi-byte sequences. When you read blocks with a fixed number of bytes, you will sometimes end up with the beginning of a sequence in one block and the remainder in the next one. This is the situation in the error you post.

How to solve it - two options:

  • Use one of the built-in ways to handle it, eg. opening the file as a utf-8-encoded text file, or using a stream decoder, and let the standard library handle it. This is usually the better approach.
  • If you need to handle it manually: On blocks other than the last, check the end of the block, removing any incomplete multi-byte sequence (or simply a multi-byte sequence, which will be easier to detect), then putting it at the beginning of the next block.

Upvotes: 2

Related Questions