Reputation: 97
I believe most of you who are familiar with Python have read Dive Into Python 3. In chapter 4.3, it says this:
In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question.
Somehow I understand what this means: strings = characters in the Unicode set, and Python can help you encode characters according to different encoding methods. However, are characters in Pythons stored as bytes in computers anyway? For example, s = 'strings', and s is surely stored in my computer as a byte strem '0100100101...' or whatever. Then what is this encoding method used here - The "default" encoding method of Python?
Thanks!
Upvotes: 5
Views: 4460
Reputation: 354356
Python 3 distinguishes between text and binary data. Text is guaranteed to be in Unicode, though no specific encoding is specified, as far as I could see. So it could be UTF-8, or UTF-16, or UTF-32¹ – but you wouldn't even notice.
The main point here is: You shouldn't even care. If you want to deal with text, then use text strings and access them by code point (which is the number of a single Unicode character and independent of the internal UTF – which may organise code points in several smaller code units). If you want bytes, then use b""
and access them by byte. And if you want to have a string in a byte sequence in a specific encoding, you use .encode()
.
¹ Or even UTF-9, if someone is insane enough to implement Python on a PDP-10.
Upvotes: 8