Reputation: 2165
I've been doing a lot of reading about python Unicode encoding etcs and i think have a some understanding of it now. One final thing remains though.
Here is how i under stand it
In python 2.x the str object represents strings as bytes, depending on the encoding for these bytes we can get different characters. This is a simplification i know but for this question does not matter.
The unicode object however ive been told represents strings as unicode code points, so basically integers. No more ambiguously interpreting bytes into their values as we did before.
My question is how are these Unicode codepoints / integers represented under the hood in python, are they just 4 byte numbers regardless. Does this mean they use a lot more space than their str counterpart. Not that I'm worried about the space, I just want to understand.
Upvotes: 1
Views: 101
Reputation: 798646
In CPython before 3.3, the text data in unicode
objects is encoded as UCS-2 or UCS-4 (depending on a compile-time option) and stored in a char*
. 3.3 uses a variable representation for unicode
data depending on the highest codepoint in the string. Jython and IronPython use their native types for unicode
storage.
Upvotes: 3