user1814023
user1814023

Reputation:

C Character Set - Need Clarification

I was reading the GCC C preprocessor -> Tokenization, in which it is mentioned that

Preprocessing tokens fall into five broad classes:

  1. identifiers
  2. preprocessing numbers
  3. string literals
  4. punctuators
  5. other.

Any other single character is considered “other”. It is passed on to the preprocessor's output unmolested. The C compiler will almost certainly reject source code containing “other” tokens. In ASCII, the only other characters are ‘@’, ‘$’, ‘`’, and control characters other than NUL (all bits zero).

I was also browsing the web and I came across 'C Character Set' in which they have mentioned '@' as one of the character. Is the article which is mentioning '@' as one of the 'C Character Set' is wrong? or my understanding is wrong?

Thanks.

Upvotes: 1

Views: 342

Answers (3)

ams
ams

Reputation: 25599

I presume you mean the character set you get when you set LANG=C?

That's not the same thing at all. That's a locale that basically just says "use ASCII" with no special extras. It requires no extra translation files or terminal support. It just means you get the default output from everything.


Alternatively, maybe you really did mean the set of characters that may appear in a C program?

Don't forget that C programs may use those characters inside quotes. Just because they don't have a meaning in any language keyword or variable doesn't mean they can't exist in the file. On the other hand, it may be an error to include UTF-8 characters inside a C string, for example.

Just because a character is valid inside a C program, doesn't mean it's valid everywhere. The if keyword is not valid outside a function, for instance.

Upvotes: 0

James Kanze
James Kanze

Reputation: 154027

I'm not sure that your question is completely clear. Both the C and the C++ standards require the compiler to support all of the characters in Unicode, although not necessarily in a transparent fashion: how the compiler maps input into its internal character set is implementation defined. But by this definition, all compilers are required to accept @, $, etc.

What you can do with any specific character is a different question, and there are a lot of characters (like @ and $) which can only appear in a comment, a string literal or a character literal (which resolves to a preprocessor number in the text you quote). Symbols, for example, may only contain _ and characters for which the Unicode type is a letter or a digit (roughly speaking—the standard specifies exactly what characters are and are not allowed).

Since how the implementation maps the characters in the input to the source character set is implementation defined, a compiler can map 0x40 (which would be a @ in ASCII, Latin-1 or Unicode) to some other character, which is allowed in a symbol. I don't know of any which take this route; I suspect, in fact, that a compiler which wanted to allow @ or $ in a symbol would simply choose to be non-conformant, rather than make it impossible to have the character in a string literal.

Upvotes: 1

Mats Petersson
Mats Petersson

Reputation: 129524

There are some compilers that allow "extra" characters, such as @ or $ as part of identifiers. This is not part of the standard, but extensions. From memory, it is mentioned in the C++ standard in a way that indicates that "a compiler may add extra characters".

Section 2.3:

The basic source character set consists of 96 characters: the space character, the control characters repre- senting horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:(14)

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ∼ ! = , \ " ’

(Note 14: The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.

Upvotes: 2

Related Questions