How are Unicode objects represented in Python?

Question

I've been doing a lot of reading about python Unicode encoding etcs and i think have a some understanding of it now. One final thing remains though.

Here is how i under stand it

In python 2.x the str object represents strings as bytes, depending on the encoding for these bytes we can get different characters. This is a simplification i know but for this question does not matter.

The unicode object however ive been told represents strings as unicode code points, so basically integers. No more ambiguously interpreting bytes into their values as we did before.

My question is how are these Unicode codepoints / integers represented under the hood in python, are they just 4 byte numbers regardless. Does this mean they use a lot more space than their str counterpart. Not that I'm worried about the space, I just want to understand.

Ignacio Vazquez-Abrams · Accepted Answer

In CPython before 3.3, the text data in unicode objects is encoded as UCS-2 or UCS-4 (depending on a compile-time option) and stored in a char*. 3.3 uses a variable representation for unicode data depending on the highest codepoint in the string. Jython and IronPython use their native types for unicode storage.

How are Unicode objects represented in Python?

Answers (1)

Related Questions