mrgou
mrgou

Reputation: 2478

UTF-8 decoding an ANSI encoded file throws an error

Here's something I'm trying to understand. I was under the impression that UTF-8 was backwards compatible, so that I can always decode a text file with UTF-8, even if it's an ANSI file. But that doesn't seem to be the case:

In [1]: ansi_str = 'éµaØc'

In [2]: with open('test.txt', 'w', encoding='ansi') as f:
   ...:     f.write(ansi_str)
   ...:

In [3]: with open('test.txt', 'r', encoding='utf-8') as f:
   ...:     print(f.read())
   ...:
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-b0711b7b947e> in <module>
      1 with open('test.txt', 'r', encoding='utf-8') as f:
----> 2     print(f.read())
      3

c:\program files\python37\lib\codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

So it looks like if my code expects UTF-8, and is likely to encounter an ANSI-encoded file, I need to handle the UnicodeDecodeError. That's fine, but I would appreciate if anyone could throw some light on my initial misunderstanding.

Thanks!

Upvotes: 0

Views: 1253

Answers (1)

deceze
deceze

Reputation: 522250

UTF-8 is backwards compatible with ASCII. Not ANSI. "ANSI" doesn't even describe any one particular encoding. And those characters you're testing with are well outside the ASCII range, so unless you actually encode them with UTF-8, you can't read them as UTF-8.

Upvotes: 2

Related Questions