Reputation: 2821
I am currently working on an encryption/decryption program in python 3 and it works fine with strings; however, I am getting some problems converting it to use byte strings as in in UTF-8 a character can be expressed in anywhere from 1 to 4 bytes.
>>>'\u0123'.encode('utf-8')
b'\xc4\xa3'
>>>'\uffff'.encode('utf-8')
b'\xef\xbf\xbf'
After some research, I found out that there is currently no encoding in python 3 that has a fixed length for every byte and has all the characters in UTF-8 - is there any module/function that I can use to get around this problem (like by attaching empty bytes so that each charter encodes to a byte string of length 4)?
Upvotes: 0
Views: 2050
Reputation: 1121346
UTF-8 is an encoding that will always use a variable number of bytes; how many depends on the unicode codepoints of the input text.
If you need a fixed length encoding that can handle Unicode, use UTF-32 (UTF-16 still uses either 2 or 4 bytes per codepoint).
Note that both UTF-16 and UTF-32 encodings include a Byte Order Mark code unit; an initial U+FEFF ZERO WIDTH NO-BREAK SPACE codepoint that lets a decoder know if the bytes were produced in little or big endian order. This codepoint will always be 4 bytes for UTF-32, so your output is going to be 4 + (4 * character count) long.
You can encode to a specific byte order by adding -le
or -be
to the codec, in which case the BOM is omitted:
>>> 'Hello world'.encode('utf-32')
b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'
>>> 'Hello world'.encode('utf-32-le')
b'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'
>>> 'Hello world'.encode('utf-32-be')
b'\x00\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d'
Upvotes: 2