John Gordon
John Gordon

Reputation: 2704

Printing universal characters

Can anyone explain why universal character literals (eg "\u00b1") are being encoded into char strings as UTF-8? Why does the following print the plus/minus symbol?

#include <iostream>
#include <cstring>
int main()
{
  std::cout << "\u00b1" << std::endl;
  return 0;
}

Is this related to my current locale?

Upvotes: 3

Views: 3157

Answers (4)

hamstergene
hamstergene

Reputation: 24439

String literals e.g. "abcdef" are simple byte arrays (of type const char[]). Compiler encodes non-ASCII characters in them into something that is implementation-defined. Rumors say Visual C++ uses current Windows' ANSI codepage, and GCC uses UTF-8, so you're probably on GCC :)

So, \uABCD is interpreted by compiler at compile time and converted into the corresponding value in that encoding. I.e. it can put one or more bytes into the byte array:

sizeof("\uFE58z") == 3 // visual C++ 2010
sizeof("\uFE58z") == 5 // gcc 4.4 mingw

And yet, how cout will print the byte array, depends on locale settings. You can change stream's locale via std::ios_base::imbue() call.

Upvotes: 1

Alexandre C.
Alexandre C.

Reputation: 56976

2.13.2. [...]

5/ A universal-character-name is translated to the encoding, in the execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation defined encoding. [Note: in translation phase 1, a universal-character-name is introduced whenever an actual extended character is encountered in the source text. Therefore, all extended characters are described in terms of universal-character-names. However, the actual compiler implementation may use its own native character set, so long as the same results are obtained. ]

and

2.2. [...] The values of the members of the execution character sets are implementation-defined, and any additional members are locale-specific.

In short, the answer to your question is in your compiler documentation. However:

2.2. 2/ The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal character name is less than 0x20 or in the range 0x7F-0x9F (inclusive), or if the universal character name designates a character in the basic source character set, then the program is illformed.

so you are guaranteed that the character you name is translated into an implementation defined encoding, possibly locale specific.

Upvotes: 4

Harold Sota
Harold Sota

Reputation: 7566

C++ Character Sets

With the standardization of C++, it's useful to review some of the mechanisms included in the language for dealing with character sets. This might seem like a very simple issue, but there are some complexities to contend with.

The first idea to consider is the notion of a "basic source character set" in C++. This is defined to be:

    all ASCII printing characters 041 - 0177, save for @ $ ` DEL

    space

    horizontal tab

    vertical tab

    form feed

    newline

or 96 characters in all. These are the characters used to compose a C++ source program.

Some national character sets, such as the European ISO-646 one, use some of these character positions for other letters. The ASCII characters so affected are:

    [ ] { } | \

To get around this problem, C++ defines trigraph sequences that can be used to represent these characters:

    [       ??(

    ]       ??)

    {       ??<

    }       ??>

    |       ??!

    \       ??/

    #       ??=

    ^       ??'

    ~       ??-

Trigraph sequences are mapped to the corresponding basic source character early in the compilation process.

C++ also has the notion of "alternative tokens", that can be used to replace tokens with others. The list of tokens and their alternatives is this:

    {       <%

    }       %>

    [       <:

    ]       :>

    #       %:

    ##      %:%:

    &&      and

    |       bitor

    ||      or

    ^       xor

    ~       compl

    &       bitand

    &=      and_eq

    |=      or_eq

    ^=      xor_eq

    !       not

    !=      not_eq

Another idea is the "basic execution character set". This includes all of the basic source character set, plus control characters for alert, backspace, carriage return, and null. The "execution character set" is the basic execution character set plus additional implementation-defined characters. The idea is that a source character set is used to define a C++ program itself, while an execution character set is used when a C++ application is executing.

Given this notion, it's possible to manipulate additional characters in a running program, for example characters from Cyrillic or Greek. Character constants can be expressed using any of:

    \137            octal

    \xabcd          hexadecimal

    \u12345678      universal character name (ISO/IEC 10646)

    \u1234          -> \u00001234

This notation uses the source character set to define execution set characters. Universal character names can be used in identifiers (if letters) and in character literals:

    '\u1234'

    L'\u2345'

The above features may not yet exist in your local C++ compiler. They are important to consider when developing internationalized applications.

Upvotes: 0

user195488
user195488

Reputation:

\u00b1 is the ± symbol as that is the correct unicode representation regardless of locale.

Your code at ideone, see here.

Upvotes: 1

Related Questions