Marco Scannadinari
Marco Scannadinari

Reputation: 1874

UTF-8 escape sequence in C string literal

In C, I specify a Unicode character with the form:

"\uCODEPOINT"

However, I can't find any details on how that is stored. Is it UTF-8, 16, 32? Is there a notation to specify UTF-8 encoding, or do I have to write each byte in hexadecimal?

Upvotes: 1

Views: 5001

Answers (2)

Jim Balter
Jim Balter

Reputation: 16406

\uXXXX is a (short form) universal character name. You can use, say, \u0041 anywhere in your program in place of A -- this can be in the source text, e.g., as part of an identifier, or it can be in a character or string literal. If you use it in a literal, it will be exactly the same as if you used A in that literal. The same applies to the names of characters with encodings longer than 8 bits ... you can use the universal name, or you can enter the character directly if you have an input method that allows you to. How the character is encoded in memory is implementation-dependent, dependent on whether the character appears in an "" or L"" literal, and dependent on whether the character is a member of the execution character set. Note this from the C standard:

Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation- defined member other than the null (wide) character.)

In an implementation that uses the UTF-8 encoding to represent non-wide strings, then \uXXXX appearing in a non-wide string literal will of course be encoded in UTF-8, along with all the other characters in the literal. If the \uXXXX occurs in a wide string literal, it will be encoded as a wide character with value 0xXXXX.

Upvotes: 3

ldav1s
ldav1s

Reputation: 16305

However, I can't find any details on how that is stored.

The execution character set is implementation dependent. However, some compilers do have some sort of options to change it if the default is not what you want. The C11 standard has additional ways to specify Unicode string literals in UTF encodings (e.g. u8"Hello").

Upvotes: 1

Related Questions