Engineer999
Engineer999

Reputation: 3955

Would a C compiler actually have an ascii look up table

I know there are a few similar questions around relating to this, but it's still not completely clear.

For example: If in my C source file, I have lots of defined string literals, as the compiler is translating this source file, does it go through each character of strings and use a look-up table to get the ascii number for each character?

I'd guess that if entering characters dynamically into a a running C program from standard input, it is the terminal that is translating actual characters to numbers, but then if we have in the code for example :

if (ch == 'c'){//.. do something}

the compiler must have its own way of understanding and mapping the characters to numbers?

Thanks in advance for some help with my confusion.

Upvotes: 4

Views: 921

Answers (3)

Adrian McCarthy
Adrian McCarthy

Reputation: 48038

The C standard talks about the source character set, which the set of characters it expects to find in the source files, and the execution character set, which is set of characters used natively by the target platform.

For most modern computers that you're likely to encounter, the source and execution character sets will be the same.

A line like if (ch == 'c') will be stored in the source file as a sequence of values from the source character set. For the 'c' part, the representation is likely 0x27 0x63 0x27, where the 0x27s represent the single quote marks and the 0x63 represents the letter c.

If the execution character set of the platform is the same as the source character set, then there's no need to translate the 0x63 to some other value. It can just use it directly.

If, however, the execution character set of the target is different (e.g., maybe you're cross-compiling for an IBM mainframe that still uses EBCDIC), then, yes, it will need a way to look up the 0x63 it finds in the source file to map it to the actual value for a c used in the target character set.


Outside the scope of what's defined by the standard, there's the distinction between character set and encoding. While a character set tells you what characters can be represented (and what their values are), the encoding tells you how those values are stored in a file.

For "plain ASCII" text, the encoding is typically the identity function: A c has the value 0x63, and it's encoded in the file simply as a byte with the value of 0x63.

Once you get beyond ASCII, though, there can be more complex encodings. For example, if your character set is Unicode, the encoding might be UTF-8, UTF-16, or UTF-32, which represent different ways to store a sequence of Unicode values (code points) in a file.

So if your source file uses a non-trivial encoding, the compiler will have to have an algorithm and/or a lookup table to convert the values it reads from the source file into the source character set before it actually does any parsing.

On most modern systems, the source character set is typically Unicode (or a subset of Unicode). On Unix-derived systems, the source file encoding is typically UTF-8. On Windows, the source encoding might be based on a code page, UTF-8, or UTF-16, depending on the code editor used to create the source file.

On many modern systems, the execution character set is also Unicode, but, on an older or less powerful computer (e.g., an embedded system), it might be restricted to ASCII or the characters within a particular code page.


Edited to address follow-on question in the comments

Any tool that reads text files (e.g., an editor or a compiler) has three options: (1) assume the encoding, (2) take an educated guess, or (3) require the user to specify it.

Most unix utilities assume UTF-8 because UTF-8 is ubiquitous in that world.

Windows tools usually check for a Unicode byte-order mark (BOM), which can indicate UTF-16 or UTF-8. If there's no BOM, it might apply some heuristics (IsTextUnicode) to guess the encoding, or it might just assume the file is in the user's current code page.

For files that have only characters from ASCII, guessing wrong usually isn't fatal. UTF-8 was designed to be compatible with plain ASCII files. (In fact, every ASCII file is a valid UTF-8 file.) Also many common code pages are supersets of ASCII, so a plain ASCII file will be interpreted correctly. It would be bad to guess UTF-16 or UTF-32 for plain ASCII, but that's unlikely given how the heuristics work.

Regular compilers don't expend much code dealing with all of this. The host environment can handle many of the details. A cross-compiler (one that runs on one platform to make a binary that runs on a different platform) might have to deal with mapping between character sets and encodings.

Upvotes: 4

AntoineL
AntoineL

Reputation: 946

No, the compiler is not required to have such a thing. Think a minute about a pre-C11 compiler, reading EBCDIC source and translating for an EBCDIC machine. What use would have an ASCII look-up table in such a compiler?

Also think another minute about how such ASCII look-up table(s) would look like in such a compiler!

Upvotes: 0

Bathsheba
Bathsheba

Reputation: 234875

Sort of. Except you can drop the ASCII bit, in full generality at least.

The mapping used between int literals like 'c' and the numeric equivalent is a function of the encoding used by the architecture that the compiler is targeting. ASCII is one such encoding, but there are others, and the C standard places only minimal requirements on the encoding, an important one being that '0' through to '9' must be consecutive, in one block, positive and able to fit into a char. Another requirement is that 'A' to 'Z' and 'a' to 'z' must be positive values that can fit into a char.

Upvotes: 1

Related Questions