Reputation: 4853
I have been using python to do ascii-to-binary translations and kept running into issues with parsing the result. Eventually I thought to look at what the Python commands were generating.
There seems to be a rouge 0xc2
inserted in the output (for example):
$ python -c 'print("\x80")' | xxd
00000000: c280 0a ...
Indeed this happens regardless of where such bytes are used:
$ python -c 'print("Test\x80Test2\x81")' | xxd
00000000: 5465 7374 c280 5465 7374 32c2 810a Test..Test2...
On a hunch, I poked around at UTF-8 and sure enough, U+0080
is encoded as 0xc2 0x80
. Apparently, Python takes the liberty of assuming by \x80
I actually meant the encoding for U+0080
. Is there a way to change this default behavior or otherwise explicitly dictate my intention of including the singlar byte 0x80
and not the UTF encoding?
Python 3.6.2
Upvotes: 3
Views: 2267
Reputation: 55479
If you want to output raw bytes in Python 3 you shouldn't be using the print
function, since it's for outputting text in your default encoding. Instead, you can use sys.stdout.buffer.write
.
ASCII is a 7 bit encoding, so if your so-called ASCII contains characters like b'\x80'
it's not legal ASCII. Perhaps your data is actually encoded with iso-8859-1, aka latin-1, or it could be the closely-related Windows variant cp1252. To do this kind of thing correctly you need to determine the actual encoding that was used to create the data.
If you want to output "Test\x80Test2\x81"
and have the hex dump look like this:
00000000 54 65 73 74 80 54 65 73 74 32 81 |Test.Test2.|
You can do
import sys
s = "Test\x80Test2\x81"
sys.stdout.buffer.write(s.encode('latin1'))
This works because Latin-1 is a subset of Unicode. Here's a quick demo:
import binascii
a = ''.join([chr(i) for i in range(256)])
b = a.encode('latin1')
print(binascii.hexlify(b))
output
b'000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff'
However, if you're actually working with binary data then you shouldn't be storing it in text strings in the first place, you should be using bytes
, or possibly bytearray
. The sane way to produce the b
bytes string from my previous example is to do
b = bytes(range(256))
And if you have a bytes
object like b"Test\x80Test2\x81"
you can dump those bytes to stdout with
sys.stdout.buffer.write(b"Test\x80Test2\x81")
Upvotes: 4
Reputation: 17267
Python 3 does the right thing of inserting a character into a str
which is string of characters, not a byte sequence.
UTF8 is the default encoding. If you need to insert a byte, a different encoding where that character is represented as a byte is needed.
$ PYTHONIOENCODING=iso-8859-1 python3 -c 'print("\x80")' | xxd
00000000: 800a
PYTHONIOENCODING
If this is set before running the interpreter, it overrides the encoding used for stdin/stdout/stderr, in the syntax encodingname:errorhandler. Both the encodingname and the :errorhandler parts are optional and have the same meaning as in str.encode().
Upvotes: 5