Reputation:
When I read a whole file, my script works fine without a problem
fst = 0
with open(in_ckfile, 'rb', 0) as file:
with open(outfile_namepath, mode='wb') as outfile:
while True:
#buf = file.read(204800)
buf = file.read()
if buf:
fst += 1
print('read no., len of buf ......: ', fst, len(buf))
buf = buf.decode()
xbytes = bytearray()
xbytes.extend(map(ord, buf))
buf = xbytes
print('read no., len of decode buf: ', fst, len(buf))
And, the result of the process is as shown below::
read no., len of buf ......: 1 26848013
read no., len of decode buf: 1 18546777
len of in string ..........: 18546777
len of output str, checked : 18546777 370130
However, when I divide the reading by units as: buf = file.read(204800) it gives an error:
read no., len of buf ......: 1 204800
read no., len of decode buf: 1 141406
len of in string ..........: 141406
len of output str, checked : 141406 2827
read no., len of buf ......: 2 204800
read no., len of decode buf: 2 141606
len of in string ..........: 141606
len of output str, checked : 141606 2800
read no., len of buf ......: 3 204800
Traceback (most recent call last):
File "<pyshell#155>", line 1, in <module>
...
buf = buf.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 204799: unexpected end of data
How do I fix the issue
Upvotes: 0
Views: 96
Reputation: 6940
In UTF-8, many characters are encoded as multi-byte sequences. When you read blocks with a fixed number of bytes, you will sometimes end up with the beginning of a sequence in one block and the remainder in the next one. This is the situation in the error you post.
How to solve it - two options:
Upvotes: 2