Reputation: 8129
To encode Unicode/UTF-8 characters in my program, I've been using the \uXXXX
escape sequences, such as:
wchar_t superscript_4 = L'\u2074'; // U+2074 SUPERSCRIPT 4 '⁴'
wchar_t subscript_4 = L'\u2084'; // U+2084 SUBSCRIPT 4 '₄'
However, using hexadecimal should work just fine, since Unicode is encoded in hexadecimal.
wchar_t superscript_4 = 0x2074;
wchar_t subscript_4 = 0x2084;
Will the second example encode the character properly? Will I run into wide-char issues, segmentation faults, or incorrectly stored character values? If so, why? If not, why?
Upvotes: 0
Views: 1264
Reputation: 7443
You could initialize them with hex constants, but you could also initialize normal char
s with numeric constants, e.g. char c = 67;
. It works the same way; it assigns whatever char
or wchar_t
has the value of that int
. In the example you give, and assuming a Unicode execution environment (not quite guaranteed but highly probable) it’s subscript or superscript 4; in my example it’s a capital C
.
In particular, for regular char
s, technically character constants like 'C'
have type int
, and you are usually assigning int
values to char
s. For wchar_t
s, the constants do actually have wchar_t
type, and the integral value is the same value you’d get by calling mbtowc
. So assuming you’re working in a Unicode environment, the hex constants are equivalent to the Unicode escapes.
Usually you don’t want to do this, though; using character literals makes it clearer what your intention is. This is especially true if you use non-ASCII characters in your source code, in which case you can make the code just be
wchar_t superscript_4 = L'⁴'
wchar_t subscript_4 = L'₄'
Also note that for many purposes it’s better to use char16_t
or char32_t
, because wchar_t
can have different widths on different platforms; it might also be cleaner to just use UTF-8 until you have a specific need to switch to something else.
Upvotes: 1