Vladimir Shevyakov
Vladimir Shevyakov

Reputation: 2821

Fixed-length encoding in Python 3

I am currently working on an encryption/decryption program in python 3 and it works fine with strings; however, I am getting some problems converting it to use byte strings as in in UTF-8 a character can be expressed in anywhere from 1 to 4 bytes.

>>>'\u0123'.encode('utf-8')
b'\xc4\xa3'
>>>'\uffff'.encode('utf-8')
b'\xef\xbf\xbf'

After some research, I found out that there is currently no encoding in python 3 that has a fixed length for every byte and has all the characters in UTF-8 - is there any module/function that I can use to get around this problem (like by attaching empty bytes so that each charter encodes to a byte string of length 4)?

Upvotes: 0

Views: 2050

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121346

UTF-8 is an encoding that will always use a variable number of bytes; how many depends on the unicode codepoints of the input text.

If you need a fixed length encoding that can handle Unicode, use UTF-32 (UTF-16 still uses either 2 or 4 bytes per codepoint).

Note that both UTF-16 and UTF-32 encodings include a Byte Order Mark code unit; an initial U+FEFF ZERO WIDTH NO-BREAK SPACE codepoint that lets a decoder know if the bytes were produced in little or big endian order. This codepoint will always be 4 bytes for UTF-32, so your output is going to be 4 + (4 * character count) long.

You can encode to a specific byte order by adding -le or -be to the codec, in which case the BOM is omitted:

>>> 'Hello world'.encode('utf-32')
b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'
>>> 'Hello world'.encode('utf-32-le')
b'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'
>>> 'Hello world'.encode('utf-32-be')
b'\x00\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d'

Upvotes: 2

Related Questions