splinter123
splinter123

Reputation: 1203

Character encoding with Python 3

If I run

print(chr(244).encode())

I get the two-byte result b'\xc3\xb4'. Why is that? I imagine the number 244 can be encoded into one byte!

Upvotes: 0

Views: 315

Answers (3)

Martijn Pieters
Martijn Pieters

Reputation: 1121904

Your default locale appears to use UTF-8 as the output encoding.

Any codepoint outside the range 0-127 is encoded with multiple bytes in the variable-width UTF-8 codec.

You'll have to use a different codec to encode that codepoint to one byte. The Latin-1 encoding can manage it just fine, while the EBCDIC 500 codec (codepage 500) can too, but encodes to a different byte:

>>> print(chr(244).encode('utf8'))
b'\xc3\xb4'
>>> print(chr(244).encode('latin1'))
b'\xf4'
>>> print(chr(244).encode('cp500'))
b'\xcb'

But Latin-1 and EBCDIC 500 codecs can only encode 255 codepoints; UTF-8 can manage all of the Unicode standard.

If you were expecting the number 244 to be interpreted as a byte value instead, then you should not use chr().encode(); chr() produces a unicode value, not a 'byte', and encoding then produces a different result depending on the exact codec. That's because unicode values are text, not bytes.

Pass your number as a list of integers to the bytes() callable instead:

>>> bytes([244])
b'\xf4'

This only happens to fit the Latin-1 codec result, because the first 256 Unicode codepoints map directly to Latin 1 bytes, by design.

Upvotes: 2

abarnert
abarnert

Reputation: 365717

I imagine the number 244 can be encoded into one byte!

Sure, if you design an encoding that can only handle 256 code points, all of them can be encoded into one byte.

But if you design an encoding that can handle all of Unicode's 111000+ code points, obviously you can't pack all of them into one byte.

If your only goal were to make things as compact as possible, you could use most of the 256 initial byte values for common code points, and only reserve a few as start bytes for less common code points.

However, if you only use the lower 128 for single-byte values, there are some big advantages. Especially if you design it so that every byte is unambiguously either a 7-bit character, a start byte, or a continuation byte. That makes the algorithm is a lot simpler to implement and faster, you can always scan forward or backward to the start of a character, you can search for ASCII text in a string with traditional byte-oriented (strchr) searches, a simple heuristic can detect your encoding very reliably, you can always detect truncated string start/end instead of misinterpreting it, etc. So, that's exactly what UTF-8 does.

Wikipedia explains UTF-8 pretty well. Rob Pike, one of the inventors of UTF-8, explains the design history in detail.

Upvotes: 0

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798646

Character #244 is U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX which is indeed encoded as 0xc3 0xb4 in UTF-8. If you want to use a single-byte encoding then you need to specify it.

Upvotes: 0

Related Questions