Eric Postpischil
Eric Postpischil

Reputation: 222938

What does the C standard specify for the value of a character constant with a hexadecimal escape sequence?

What does the C 2018 standard specify for the value of a hexadecimal escape sequence such as '\xFF'?

Consider a C implementation in which char is signed and eight bits.

Clause 6.4.4.4 tells us about character constants. In paragraph 6, it discusses hexadecimal escape sequences:

The hexadecimal digits that follow the backslash and the letter x in a hexadecimal escape sequence are taken to be part of the construction of a single character for an integer character constant or of a single wide character for a wide character constant. The numerical value of the hexadecimal integer so formed specifies the value of the desired character or wide character.

The hexadecimal integer is “FF”. By the usual rules of hexadecimal notation, its value1 is 255. Note that, so far, we do not have a specific type: A “character” is a “member of a set of elements used for the organization, control, or representation of data” (3.7) or a “bit representation that fits in a byte” (3.7.1). When \xFF is used in '\xFF', it is a c-char in the grammar (6.4.4.4 1), and '\xFF' is an integer character constant. Per 6.4.4.4 2, “An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in ’x’.”

6.4.4.4 9 specifies constraints on character constants:

The value of an octal or hexadecimal escape sequence shall be in the range of representable values for the corresponding type:

That is followed by a table that, for character constants with no prefix, shows the corresponding type is unsigned char.

So far, so good. Our hexadecimal escape sequence has value 255, which is in the range of an unsigned char.

Then 6.4.4.4 10 purports to tell us the value of the character constant. I quote it here with its sentences separated and labeled for reference:

(i) An integer character constant has type int.

(ii) The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer.

(iii) The value of an integer character constant containing more than one character (e.g., ’ab’ ), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.

(iv) If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.

If 255 maps to an execution character, (ii) applies, and the value of '\xFF' is the value of that character. This is the first use of “maps” in the standard; it is not defined elsewhere. Should it mean anything other than a map from the value derived so far (255) to an execution character with the same value? If so, for (ii) to apply, there must be an execution character with the value 255. Then the value of '\xFF' would be 255.

Otherwise (iii) applies, and the value of '\xFF' is implementation-defined.

Regardless of whether (ii) or (iii) applies, (iv) also applies. It says the value of '\xFF' is the value of a char object whose value is 255, subsequently converted to int. But, since char is signed and eight-bit, there is no char object whose value is 255. So the fourth sentence states an impossibility.

Footnote

1 3.19 defines “value” as “precise meaning of the contents of an object when interpreted as having a specific type,” but I do not believe that technical term is being used here. “The numerical value of the hexadecimal integer” has no object to discuss yet. This appears to be a use of the word “value” in an ordinary sense.

Upvotes: 5

Views: 483

Answers (1)

chqrlie
chqrlie

Reputation: 144810

Your demonstration leads to an interesting conclusion:

There is no portable way to write character constants with values outside the range 0 .. CHAR_MAX. This is not necessarily a problem for single characters as one can use integers in place of character constants, but there is no such alternative for string constants.

It seems type char should always be unsigned by default for consistency with many standard C library functions:

  • fgetc() returns an int with a negative value EOF for failure and the value of an unsigned char if a byte was successfully read. Hence the meaning and effect of fgetc() == '\xFF' is implementation defined.

  • the functions from <ctype.h> accept an int argument with the same values as those returned by fgetc(). Passing a negative char value has undefined behavior.

  • strcmp() and compares strings based on the values of characters converted to unsigned char.

  • '\xFF' may have the value -1 which is completely unintuitive and is potentially identical to the value of EOF.

The only reason to make or keep char signed by default is compatibility with older compilers for historical code that relies on this behavior and were written before the advent of signed char, some 30 years ago!

I strongly advise programmers to use -funsigned-char to make char unsigned by default and use signed char or better int8_t if one needs signed 8-bit variables and structure members.

As hyde commented, to avoid portability problems, char values should be cast as (unsigned char) where the signedness of char may pose problems: for example:

    char str[] = "Hello world\n";
    for (int i = 0; str[i]; i++)
        str[i] = tolower((unsigned char)str[i]);

Upvotes: 3

Related Questions