How does Python 2 represent Unicode internally?

Question

When I read this Python2's official page on Unicode, it says

Under the hood, Python represents Unicode strings as either 16-or 32-bit integers, depending on how the Python interpreter was compiled.

What does above sentence mean? Could it mean that Python2 has its own special encodings of Unicode? If so, why not just use UTF-8?

Ulrich Eckhardt · Accepted Answer

This statement simply means that there is underlying C code that uses both these encodings and that depending on the circumstances, either variant is chosen. Those circumstances are typically user choice, compiler and operating system.

Now, for the possible rationale for that, there are reasons not to use UTF-8:

First and foremost, indexing into a UTF-8 string is O(n) in complexity, while it is O(1) for UTF-32/UCS4. While that is irrelevant for streamed data and UTF-8 can actually save space for transmission or storage, in-memory handling is more convenient with one character per Unicode codepoint.
Secondly, using one character per codepoint translates very well to the API that Python itself provides in its language, so this is a natural choice.
On MS Windows platforms, the native encoding for UI and filesystem is UTF-16, so using that encoding provides seamless integration with that platform.
On some compilers wchar_t is actually a 16-bit type, so if you wanted to use a 32-bit type there you would have to reimplement all kinds of functions for your self-invented character type. Dropping support for anything above the Unicode BMP or leaking surrogate sequences into the Python API is a reasonable compromise then (but one that sticks unfortunately).

Note that those are possible reasons, I don't claim that these apply to Python's implementation.

How does Python 2 represent Unicode internally?

Answers (1)

Related Questions