Why does Python automatically encode hex in strings as UTF-8?

Question

I have been using python to do ascii-to-binary translations and kept running into issues with parsing the result. Eventually I thought to look at what the Python commands were generating.

There seems to be a rouge 0xc2 inserted in the output (for example):

$ python -c 'print("\x80")' | xxd
00000000: c280 0a                                  ...

Indeed this happens regardless of where such bytes are used:

$ python -c 'print("Test\x80Test2\x81")' | xxd
00000000: 5465 7374 c280 5465 7374 32c2 810a       Test..Test2...

On a hunch, I poked around at UTF-8 and sure enough, U+0080 is encoded as 0xc2 0x80. Apparently, Python takes the liberty of assuming by \x80 I actually meant the encoding for U+0080. Is there a way to change this default behavior or otherwise explicitly dictate my intention of including the singlar byte 0x80 and not the UTF encoding?

Python 3.6.2

VPfB · Accepted Answer

Python 3 does the right thing of inserting a character into a str which is string of characters, not a byte sequence.

UTF8 is the default encoding. If you need to insert a byte, a different encoding where that character is represented as a byte is needed.

$ PYTHONIOENCODING=iso-8859-1 python3 -c 'print("\x80")' | xxd
00000000: 800a

PYTHONIOENCODING

If this is set before running the interpreter, it overrides the encoding used for stdin/stdout/stderr, in the syntax encodingname:errorhandler. Both the encodingname and the :errorhandler parts are optional and have the same meaning as in str.encode().

Why does Python automatically encode hex in strings as UTF-8?

Answers (2)

Related Questions