Reputation: 5931
I am following example code in Programming Python, and something confuses.Here's the code that writes simple string to a file and then reads it back
>>> data = 'sp\xe4m' # data to your script
>>> data, len(data) # 4 unicode chars, 1 nonascii
('späm', 4)
>>> data.encode('utf8'), len(data.encode('utf8')) # bytes written to file
(b'sp\xc3\xa4m', 5)
>>> f = open('test', mode='w+', encoding='utf8') # use text mode, encoded
>>> f.write(data)
>>> f.flush()
>>> f.seek(0); f.read(1) # ascii bytes work
's'
>>> f.seek(2); f.read(1) # as does 2-byte nonascii
'ä'
>>> data[3] # but offset 3 is not 'm' !
'm'
>>> f.seek(3); f.read(1)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa4 in position 0:
unexpected code byte
Now, what confuses me is this, why this UnicodeDecodeError is happening if data string is utf8 encoded? Reading with manual f.read() works fine, but when using seek to jump and read(1), this error shows up.
Upvotes: 0
Views: 870
Reputation: 1121814
Seeking within a file will move the read pointer by bytes, not by characters. The .read()
call expects to be able to read whole characters instead. Because UTF-8 uses multiple bytes for any unicode codepoint beyond the ASCII character set, you cannot just seek into the middle of a multi-byte UTF-8 codepoint and expect .read()
to work.
The U+00a4 codepoint (the glyph ä
) is encoded to two bytes, C3 and A4. In the file, this means there are now 5 bytes, representing s
, p
, the hex bytes C3 and A4, then m
.
By seeking to position 3, you moved the file header to the A4 byte, and calling .read()
then fails because without the preceding C3 byte, there is not enough context to decode the character. This raises the UnicodeDecodeError
; the A4 byte is unexpected, as it is not a valid UTF-8 sequence.
Seek to position 4 instead:
>>> f.seek(3); f.read(1)
'm'
Better still, don't seek around in UTF-8 data, or open the file in binary mode and decode manually.
Upvotes: 2