DrStrangeLove
DrStrangeLove

Reputation: 11567

How does low-level character encodings work?

let's say, i have a textfile called sometext.txt it has a line - "Sic semper tyrannis" which is (correct me if i'm wrong..)

83 105 99 32 115 101 109 112 101 114 32 116 121
114 97 110 110 105 115

(in decimal ASCII)

When i read this line from file using standard library file i/o routines, i don't perform any character encodings work.. (or do i??)

The question is: Which software component actually converts 0s and 1s into characters(i.e. contains algorithm for converting 0s and 1s into characters)?? Is it OS component?? Which one??

Upvotes: 4

Views: 666

Answers (5)

GolezTrol
GolezTrol

Reputation: 116180

It has nothing (well, not so much) to do's with 0s and 1s. Most character encodings work with entire bytes of 8 bits. Each of the numbers you wrote represents a single byte. In ASCII, every character is a single byte. Besides that, ASCII is a subset of ANSI and UTF-8, making it compatible with the most used character sets. ASCII contains only the first half of the byte range. Chars upto 127.

For ANSI you need some encoding. ANSI specifies the characters in the upper half of the byte range. In UTF-8, these ANSI characters don't exist. Instead, these last 128 bytes represent part of a character. A whole character is made of 2 to 4 bytes. Except those 128 ASCII characters. They are still the same old single byte characters. I think this is mainly done because if UTF-8 wouldn't be compatible with ASCII, there is no way Americans would have adopted it. ;-)

But yes, the OS does have various functions to work with character encodings. Where they are depends on the OS and platform, but if I read your question right, you're not really looking for some specific API. Your question cannot be answered that concrete. There are numerous ways to work with characters, and these is a major difference between working with the actual character data and writing them to the screen. (difference between character and font).

Upvotes: 1

dan04
dan04

Reputation: 91207

First, I recommend that you read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).


When i read this line from file using standard library file i/o routines, i don't perform any character encodings work.. (or do i??)

That depends heavily on which standard library you mean.

In C, when you write:

FILE* f = fopen("filename.txt", "w");
fputs("Sic semper tyrannis", f);

No encoding conversion is performed; the chars in the string are just written to the file as-is (except for line breaks). (Encoding is relevant when you're editing the source file.)

But in Python 3.x, when you write:

f = open('filename.txt', 'w', encoding='UTF-8')
f.write('Sic semper tyrannis')

The write function performs an internal conversion from the UTF-16/32 encoding of the Python str types to the UTF-8 encoding used on disk.


The question is: Which software component actually converts 0s and 1s into characters(i.e. contains algorithm for converting 0s and 1s into characters)?? Is it OS component?? Which one??

The decoding function (like MultiByteToWideChar or bytes.decode) for the appropriate character encoding converts the bytes into Unicode code points, which are integers that uniquely identify characters. A font converts code points to glyphs, the images of the characters that appear on screen or paper.

Upvotes: 4

user166390
user166390

Reputation:

It's all a bunch of 1's and 0's.

An ASCII "A" is just the letter displayed when the value (01000001b, or 0x41 or 65 dec) is "encountered" (depend on context, naturally). There is no "conversion"; it's just a different view of the same thing defined by an accepted mapping.


Unicode (and other multi-byte) character sets often use different encodings; in UTF-8 (a Unicode encoding), for instance, a single Unicode character can be mapped as 1, 2, 3 or 4 bytes depending upon the character. Unicode encoding conversion often takes place in the IO libraries that come as part of a language or runtime; however, a Unicode-aware operating system also needs to understand a Unicode encoding itself (in system calls) so the line can be blurred.

UTF-8 has the nice property that all normal ASCII characters map to a single byte which makes it the most compatible Unicode encoding with traditional ASCII.

Upvotes: 4

Martin James
Martin James

Reputation: 24907

Like DrStrangeLove says, it's 1's & 0's all the way to your display screen and beyond - the 'A' character is an array of pixels whose color/brightness is defined by bits in the display driver. Turning that pixel array into an understandable character needs a bioElectroChemical video camera connected to 10^11 threshold logic gates running an adaptive, massively-parallel OS and apps that no-one understands, especially after a few beers

Not exactly sure what you're asking. The 0's and 1's from the file are blocked up into the bytes that can represent ASCII codes by the disk driver - it will only read/write blocks of eight bits. The ASCII code bytes are rendered into displayable bitmaps by the display driver using the chosen font.

Rgds, Martin

Upvotes: 1

Thanatos
Thanatos

Reputation: 44344

Which software component actually converts 0s and 1s into characters(i.e. contains algorithm for converting 0s and 1s into characters)?

This depends on what languge you're using. For example, Python has character encoding functions:

>>> f = open( ...., 'rb')
>>> data = f.read()
>>> data.decode('utf-8')
u'café'

Here, Python has converted a sequence of bytes into a Unicode string. The exact component is typically a library or program in userspace, but some compilers need knowledge of character encodings.

Underneath, it's all a sequence of bytes, which are 1s and 0s. However, given a sequence of bytes, which characters do these represent? ASCII is one such "character encoding", and tells us how to encode or decode A-Z, a-z, and a few more. There are many others, noteably UTF-8 (an encoding of Unicode). In the end, if you're dealing with text, you need to know what character encoding it is encoded with.

Upvotes: 1

Related Questions