john-lenon-dos-reis
john-lenon-dos-reis

Reputation: 33

Why characters can be converted to Bytes

I have a question related to character encodings in computing (ASCII and UTF-8) and would be very grateful if anyone can help me.

We know that for a computer absolutely everything is sequences of bytes, that is, the texts and characters that we humans know are just graphical representations of the sequences of bytes interpreted by the computer.

I've read in several articles that encodings are the process of mapping characters to binaries for storage in memory. But that doesn't make sense because for the computer all the data is just bytes, so for the computer, it would be the same as mapping bytes to bytes.

I would like to know if what I am saying makes sense to you?

Upvotes: 1

Views: 851

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177461

Internally, a computer has to store characters in some manner of bytes, but that storage ideally is opaque. A "string" can store "characters", but exactly how those characters are encoded in memory is up to the program.

Encoding is the process of taking a "character" and converting it to a specific byte representation.

Decoding is the process of taking a specific byte representation and converting it back to the program's notion of a "character".

As a specific example, the Python language has a "text" type made of of Unicode code points, and a "bytes" type that is made up of byte values 0-255. You don't really need to know how the text string is stored in memory, and in fact it has changed by compile options and Python version over the years (UTF-16, UTF-32, and currently a variable encoding depending on the maximum codepoint present in the string). The text string can be encoded to a byte string and decoded back to a text string:

>>> s = '你好'  # Two Chinese characters, How are the stored in memory? Does it matter?
>>> type(s)
<class 'str'>
>>> len(s)
2
>>> b = s.encode('utf8')
>>> type(b)
<class 'bytes'>
>>> len(b)
6
>>> print(b)
b'\xe4\xbd\xa0\xe5\xa5\xbd'  # 6 bytes encoding the 2 characters in UTF-8
>>> b.decode('utf8')         # decode from UTF-8 back to text
'你好'

Upvotes: 1

Kevin
Kevin

Reputation: 1

For Ascii each character is assigned a bit pattern consisting of 7 bits. Since each bit can assume two values, there are 128 different bit patterns, which can also be interpreted as the integers 0-127 (hexadecimal 00h-7Fh).

In UTF-8 encoding, each Unicode character is assigned a specially encoded string of variable length. UTF-8 supports character strings up to a length of four bytes, to which - as with all UTF formats - all Unicode characters can be mapped.

Upvotes: 0

Related Questions