Unicode escape sequences vs hexadecimal values

Question

To encode Unicode/UTF-8 characters in my program, I've been using the \uXXXX escape sequences, such as:

wchar_t superscript_4 = L'\u2074';  // U+2074 SUPERSCRIPT 4 '⁴'
wchar_t subscript_4   = L'\u2084';  // U+2084 SUBSCRIPT 4 '₄'

However, using hexadecimal should work just fine, since Unicode is encoded in hexadecimal.

wchar_t superscript_4 = 0x2074;
wchar_t subscript_4   = 0x2084;

Will the second example encode the character properly? Will I run into wide-char issues, segmentation faults, or incorrectly stored character values? If so, why? If not, why?

Daniel H · Accepted Answer

You could initialize them with hex constants, but you could also initialize normal chars with numeric constants, e.g. char c = 67;. It works the same way; it assigns whatever char or wchar_t has the value of that int. In the example you give, and assuming a Unicode execution environment (not quite guaranteed but highly probable) it’s subscript or superscript 4; in my example it’s a capital C.

In particular, for regular chars, technically character constants like 'C' have type int, and you are usually assigning int values to chars. For wchar_ts, the constants do actually have wchar_t type, and the integral value is the same value you’d get by calling mbtowc. So assuming you’re working in a Unicode environment, the hex constants are equivalent to the Unicode escapes.

Usually you don’t want to do this, though; using character literals makes it clearer what your intention is. This is especially true if you use non-ASCII characters in your source code, in which case you can make the code just be

wchar_t superscript_4 = L'⁴'
wchar_t subscript_4   = L'₄'

Also note that for many purposes it’s better to use char16_t or char32_t, because wchar_t can have different widths on different platforms; it might also be cleaner to just use UTF-8 until you have a specific need to switch to something else.

Unicode escape sequences vs hexadecimal values

Answers (1)

Related Questions