Reputation: 68708

C++11 character literal '\xC4' standard type with UTF-8 execution character set?

Consider a C++11 compiler that has an execution character set of UTF-8 (and is compliant with the x86-64 ABI which requires the char type be a signed 8-bit byte).

The letter Ä (umlaut) has unicode code point of 0xC4, and has a 2 code unit UTF-8 representation of {0xC3, 0x84}

The compiler assigns the character literal '\xC4' a type of int with a value of 0xC4.

Is the compiler standard-compliant and ABI-compliant? What is your reasoning?

Relevant quotes from C++11 standard:

2.14.3.1

An ordinary character literal that contains a single c-char has type char, with value equal to the numerical value of the encoding of the c-char in the execution character set. An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined value.

2.14.3.4

The escape \xhhh consists of the backslash followed by x followed by one or more hexadecimal digits that are taken to specify the value of the desired character. The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char

Upvotes: 2

Answers (3)

rici

Reputation: 241901

§2.14.3 paragraph 1 is undoubtedly the relevant text in the (C++11) standard. However, there were several defects in the original text, and the latest version contains the following text, emphasis added:

A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.

Although this has been accepted as a defect, it does not actually form part of any standard. However, it stands as a recommendation and I suspect that many compilers will implement it.

Upvotes: 2

jthill

Reputation: 60442

From 2.1.14.3p4:

The value of a character literal is implementation-defined if it falls outside of the implementation-defined range defined for char

x86 compilers historically (and as you point out, that practice is now an official standard of some sort) have signed chars. \xc7 is out of range for that, so the implementation is required to document the literal value it will produce.

It looks like your implementation promotes out-of-range char literals specified with \x escapes to (in-range) integer literals.

Upvotes: 1

paulsm4

Reputation: 121809

You're mixing apples, oranges, pears and kumquats :)

Yes, "\xc4" is a legal character literal. Specifically, what the standard calls a "narrow character literal".

From the C++ standard:

The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.

This might help clarify:

Rules for C++ string literals escape character

This will might also help, if you're not familiar with it:

The absolute minimum every software developer should know about Unicode

Here is another good, concise - and illuminating - reference:

IBM Developerworks: Character literals

Upvotes: 1

C++11 character literal &#39;\xC4&#39; standard type with UTF-8 execution character set?

Answers (3)

Related Questions

C++11 character literal '\xC4' standard type with UTF-8 execution character set?