Reputation: 882
I'm working on a string unescaping function that converts literal sequences like \uxxxx
(where xxxx
is a hex value) into bytes of corresponding value. I am planning to have the function get he first two characters of the xxxx
sequence, calculate the byte value, and to the same with the second sequence.
But I ran into an unexpected result with literal typed UTF-8 characters. The following illustrates my issue:
#include <stdio.h>
int main()
{
unsigned char *str1 = "abcĢ";
unsigned char *str2 = "abc\x01\x22";
for (unsigned i = 0; i < 5; i++)
printf ("String 1 character #%u: %x\n", i, str1[i]);
for (unsigned i = 0; i < 5; i++)
printf ("String 2 character #%u: %x\n", i, str2[i]);
return 0;
}
Output:
String 1 character #0: 61
String 1 character #1: 62
String 1 character #2: 63
String 1 character #3: c4
String 1 character #4: a2
String 2 character #0: 61
String 2 character #1: 62
String 2 character #2: 63
String 2 character #3: 1
String 2 character #4: 22
Unicode character Ģ
has e hex value of \x0122
, so I would expect bytes #3 and #4 to be \x01
andx22
respectively.
Where do c4
and a2
come from? I guess I am not understanding how multi-byte characters in strings are encoded in C. Any help would be appreciated.
Upvotes: 1
Views: 642
Reputation: 308402
UTF-8 can't work in a simplistic way of breaking a large value into individual bytes, because it would be ambiguous. How would you tell the difference between "\u4142" (䅂)
and the two character string "AB"
?
The rules for producing a UTF-8 byte string from a Unicode code point number are quite simple, and eliminate the ambiguity. Given any sequence of byte values, it either defines unambiguous codepoints or it's an invalid sequence.
Here's a simple function that will convert a single Unicode codepoint value to a UTF-8 byte sequence.
void codepoint_to_UTF8(int codepoint, char * out)
/* out must point to a buffer of at least 5 chars. */
{
if (codepoint <= 0x7f)
*out++ = (char)codepoint;
else if (codepoint <= 0x7ff)
{
*out++ = (char)(0xc0 | ((codepoint >> 6) & 0x1f));
*out++ = (char)(0x80 | (codepoint & 0x3f));
}
else if (codepoint <= 0xffff)
{
*out++ = (char)(0xe0 | ((codepoint >> 12) & 0x0f));
*out++ = (char)(0x80 | ((codepoint >> 6) & 0x3f));
*out++ = (char)(0x80 | (codepoint & 0x3f));
}
else
{
*out++ = (char)(0xf0 | ((codepoint >> 18) & 0x07));
*out++ = (char)(0x80 | ((codepoint >> 12) & 0x3f));
*out++ = (char)(0x80 | ((codepoint >> 6) & 0x3f));
*out++ = (char)(0x80 | (codepoint & 0x3f));
}
*out = 0;
}
Note that this function does no error checking, so if you give it an input outside the valid Unicode range of 0 to 0x10ffff it will generate an incorrect (but still valid) UTF-8 sequence.
Upvotes: 1
Reputation: 597166
Unicode character
Ģ
has e hex value of\x0122
, so I would expect bytes #3 and #4 to be\x01
and\x22
respectively.Where do
c4
anda2
come from?
In Unicode, Ģ
is codepoint U+0122 LATIN CAPITAL LETTER G WITH CEDILLA
, which in UTF-8 is encoded as bytes 0xC4 0xA2
.
Either your source file is saved as UTF-8, or your compiler is configured to save string literals in UTF-8. Either way, in your str1
string, the literal Ģ
is stored as UTF-8. Thus:
unsigned char *str1 = "abcĢ";
is roughly equivalent to this:
unsigned char literal[] = {'a', 'b', 'c', 0xC4, 0xA2, '\0'};
unsigned char *str1 = &literal[0];
In an escape sequence, the entire sequence represents a single numeric value. So, \x01
and \x22
represent the individual numeric values 0x01
hex (1
decimal) and 0x22
hex (34
decimal), respectively. Thus:
unsigned char *str2 = "abc\x01\x22";
is roughly equivalent to this:
unsigned char literal[] = {'a', 'b', 'c', 0x01, 0x22, '\0'};
unsigned char *str2 = &literal[0];
You are simply outputting the raw bytes of the strings that str1
and str2
are pointing at.
The escape sequence \u0122
represents the numeric value 0x0122
hex (290
decimal), which in Unicode is codepoint U+0122
, hence C4 A2
in UTF-8. So, if you have an input string like this:
const char *str = "abc\\u0122"; // {'a', 'b', 'c', '\', 'u', '0', '1', '2', '2', '\0'}
And you want to decode it to UTF-8, you would need to detect the "\u"
prefix, extract the following "0122"
substring, parse it as a hex number into an integer, interpret that integer as a Unicode codepoint, and convert it to UTF-8 (a
, b
, and c
are already valid chars as-is in UTF-8).
Upvotes: 3