Unicode byte vs code point (Python)

Question

In http://nedbatchelder.com/text/unipain.html it is explained that:

In Python 2, there are two different string data types. A plain-old string literal gives you a "str" object, which stores bytes. If you use a "u" prefix, you get a "unicode" object, which stores code points.

What's the difference between code point vs byte? (I'm thinking not really in term of Python per se but just the concept in general). Essentially it's just a bunch of bits, right? I think of pain old string literal treat each 8-bits as a byte and is handled as such, and we interpret the byte as integers and that allow us to map it to ASCII and the extended character sets. What's the difference between interpreting integer as that set of characters and interpreting the "code point" as Unicode characters? It says Python's Unicode object stores "code point". Isn't that just the same as plain old bytes except possibly the interpretation (where bits of each Unicode character starts and stops as utf-8, for example)?

Ignacio Vazquez-Abrams · Accepted Answer

A code point is a number which acts as an identifier for a Unicode character. A code point itself cannot be stored, it must be encoded from Unicode into bytes in e.g. UTF-16LE. While a certain byte or sequence of bytes can represent a specific code point in a given encoding, without the encoding information there is nothing to connect the code point to the bytes.

Unicode byte vs code point (Python)

Answers (1)

Related Questions