Reputation:
Based on my own readings (including this article), it seems that by default Python encodes with UTF-8. Strings are read in under the assumption that they're in UTF-8 encoding (more source).
Those strings are then translated to plain Unicode, using Latin-1, UCS-2, or UCS-4 for the entire string depending on the highest code point of UTF-8 it encounters. This seems to match what I've done on the terminal. The character Ǧ has Unicode code point of 486, and can only be fit in UCS-2.
string1 = "Ǧ"
sys.getsizeof(string1) # This prints 76
string1 = "Ǧa"
sys.getsizeof(string1) # This prints 78, as if 'a' takes two bytes
string2 = "a"
sys.getsizeof(string2) # This prints 50
string2 = "aa"
sys.getsizeof(string2) # This prints 51, as if 'a' takes one byte
I have two questions. First off, when printing to terminal, what is the process with which strings are encoded and decoded? If we call print(), are the strings first encoded to UTF-8 (from UCS-2 or Latin-1 in our examples), where the system decodes it to print to screen? Second off, what's with the large initial increment in the size? Why do strings represented with Latin-1 have an initial size of 49, while strings with UCS-2 have an initial size of 74?
Thanks!
Upvotes: 3
Views: 3262
Reputation: 155383
Most of your points are related to PEP 393: Flexible string representation. While UTF-8 is used (on Python 3) as the default source code encoding, the default encoding for file I/O is locale based, and the internal representation is ASCII, latin-1, UTF-16 or UTF-32, depending on the largest code point, possibly with a cached UTF-8 representation and/or a cached wchar_t
representation for use with specific C APIs (deprecated APIs in the case of the wchar_t
representation).
So to answer your questions:
The terminal encoding, as noted, is platform dependent; the internal representation is reencoded to whatever your platform requires and output as bytes.
The change in the base size between ASCII and UTF-16 strings is because the flexible string representation uses a larger baseline struct for non-ASCII strings (it needs additional space to store a pointer for the cached UTF-8 encoding required by some C level APIs for instance), as well as more bytes per character.
Upvotes: 1