Reputation: 2704
Can anyone explain why universal character literals (eg "\u00b1") are being encoded into char strings as UTF-8? Why does the following print the plus/minus symbol?
#include <iostream>
#include <cstring>
int main()
{
std::cout << "\u00b1" << std::endl;
return 0;
}
Is this related to my current locale?
Upvotes: 3
Views: 3157
Reputation: 24439
String literals e.g. "abcdef"
are simple byte arrays (of type const char[]
). Compiler encodes non-ASCII characters in them into something that is implementation-defined. Rumors say Visual C++ uses current Windows' ANSI codepage, and GCC uses UTF-8, so you're probably on GCC :)
So, \uABCD
is interpreted by compiler at compile time and converted into the corresponding value in that encoding. I.e. it can put one or more bytes into the byte array:
sizeof("\uFE58z") == 3 // visual C++ 2010
sizeof("\uFE58z") == 5 // gcc 4.4 mingw
And yet, how cout
will print the byte array, depends on locale settings. You can change stream's locale via std::ios_base::imbue()
call.
Upvotes: 1
Reputation: 56976
2.13.2. [...]
5/ A universal-character-name is translated to the encoding, in the execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation defined encoding. [Note: in translation phase 1, a universal-character-name is introduced whenever an actual extended character is encountered in the source text. Therefore, all extended characters are described in terms of universal-character-names. However, the actual compiler implementation may use its own native character set, so long as the same results are obtained. ]
and
2.2. [...] The values of the members of the execution character sets are implementation-defined, and any additional members are locale-specific.
In short, the answer to your question is in your compiler documentation. However:
2.2. 2/ The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal character name is less than 0x20 or in the range 0x7F-0x9F (inclusive), or if the universal character name designates a character in the basic source character set, then the program is illformed.
so you are guaranteed that the character you name is translated into an implementation defined encoding, possibly locale specific.
Upvotes: 4
Reputation: 7566
C++ Character Sets
With the standardization of C++, it's useful to review some of the mechanisms included in the language for dealing with character sets. This might seem like a very simple issue, but there are some complexities to contend with.
The first idea to consider is the notion of a "basic source character set" in C++. This is defined to be:
all ASCII printing characters 041 - 0177, save for @ $ ` DEL
space
horizontal tab
vertical tab
form feed
newline
or 96 characters in all. These are the characters used to compose a C++ source program.
Some national character sets, such as the European ISO-646 one, use some of these character positions for other letters. The ASCII characters so affected are:
[ ] { } | \
To get around this problem, C++ defines trigraph sequences that can be used to represent these characters:
[ ??(
] ??)
{ ??<
} ??>
| ??!
\ ??/
# ??=
^ ??'
~ ??-
Trigraph sequences are mapped to the corresponding basic source character early in the compilation process.
C++ also has the notion of "alternative tokens", that can be used to replace tokens with others. The list of tokens and their alternatives is this:
{ <%
} %>
[ <:
] :>
# %:
## %:%:
&& and
| bitor
|| or
^ xor
~ compl
& bitand
&= and_eq
|= or_eq
^= xor_eq
! not
!= not_eq
Another idea is the "basic execution character set". This includes all of the basic source character set, plus control characters for alert, backspace, carriage return, and null. The "execution character set" is the basic execution character set plus additional implementation-defined characters. The idea is that a source character set is used to define a C++ program itself, while an execution character set is used when a C++ application is executing.
Given this notion, it's possible to manipulate additional characters in a running program, for example characters from Cyrillic or Greek. Character constants can be expressed using any of:
\137 octal
\xabcd hexadecimal
\u12345678 universal character name (ISO/IEC 10646)
\u1234 -> \u00001234
This notation uses the source character set to define execution set characters. Universal character names can be used in identifiers (if letters) and in character literals:
'\u1234'
L'\u2345'
The above features may not yet exist in your local C++ compiler. They are important to consider when developing internationalized applications.
Upvotes: 0
Reputation:
\u00b1
is the ±
symbol as that is the correct unicode representation regardless of locale.
Your code at ideone, see here.
Upvotes: 1