Reputation: 68708
Consider a C++11 compiler that has an execution character set of UTF-8 (and is compliant with the x86-64 ABI which requires the char
type be a signed 8-bit byte).
The letter Ä (umlaut) has unicode code point of 0xC4
, and has a 2 code unit UTF-8 representation of {0xC3, 0x84}
The compiler assigns the character literal '\xC4'
a type of int
with a value of 0xC4
.
Is the compiler standard-compliant and ABI-compliant? What is your reasoning?
Relevant quotes from C++11 standard:
2.14.3.1
An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined value.
2.14.3.4
The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char
Upvotes: 2
Views: 4385
Reputation: 241901
§2.14.3 paragraph 1 is undoubtedly the relevant text in the (C++11) standard. However, there were several defects in the original text, and the latest version contains the following text, emphasis added:
A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.
Although this has been accepted as a defect, it does not actually form part of any standard. However, it stands as a recommendation and I suspect that many compilers will implement it.
Upvotes: 2
Reputation: 60442
From 2.1.14.3p4:
The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for
char
x86 compilers historically (and as you point out, that practice is now an official standard of some sort) have signed char
s. \xc7
is out of range for that, so the implementation is required to document the literal value it will produce.
It looks like your implementation promotes out-of-range char literals specified with \x escapes to (in-range) integer literals.
Upvotes: 1
Reputation: 121809
You're mixing apples, oranges, pears and kumquats :)
Yes, "\xc4" is a legal character literal. Specifically, what the standard calls a "narrow character literal".
From the C++ standard:
The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.
This might help clarify:
This will might also help, if you're not familiar with it:
Here is another good, concise - and illuminating - reference:
Upvotes: 1