user3758232
user3758232

Reputation: 882

Numeric value of literal UTF-8 characters

I'm working on a string unescaping function that converts literal sequences like \uxxxx (where xxxx is a hex value) into bytes of corresponding value. I am planning to have the function get he first two characters of the xxxx sequence, calculate the byte value, and to the same with the second sequence.

But I ran into an unexpected result with literal typed UTF-8 characters. The following illustrates my issue:

#include <stdio.h>

int main()
{
    unsigned char *str1 = "abcĢ";
    unsigned char *str2 = "abc\x01\x22";
    for (unsigned i = 0; i < 5; i++)
        printf ("String 1 character #%u: %x\n", i, str1[i]);
    for (unsigned i = 0; i < 5; i++)
        printf ("String 2 character #%u: %x\n", i, str2[i]);

    return 0;
}

Output:

String 1 character #0: 61
String 1 character #1: 62
String 1 character #2: 63
String 1 character #3: c4
String 1 character #4: a2
String 2 character #0: 61
String 2 character #1: 62
String 2 character #2: 63
String 2 character #3: 1
String 2 character #4: 22

Unicode character Ģ has e hex value of \x0122, so I would expect bytes #3 and #4 to be \x01 andx22 respectively.

Where do c4 and a2 come from? I guess I am not understanding how multi-byte characters in strings are encoded in C. Any help would be appreciated.

Upvotes: 1

Views: 642

Answers (2)

Mark Ransom
Mark Ransom

Reputation: 308402

UTF-8 can't work in a simplistic way of breaking a large value into individual bytes, because it would be ambiguous. How would you tell the difference between "\u4142" (䅂) and the two character string "AB"?

The rules for producing a UTF-8 byte string from a Unicode code point number are quite simple, and eliminate the ambiguity. Given any sequence of byte values, it either defines unambiguous codepoints or it's an invalid sequence.

Here's a simple function that will convert a single Unicode codepoint value to a UTF-8 byte sequence.

void codepoint_to_UTF8(int codepoint, char * out)
/* out must point to a buffer of at least 5 chars. */
{
    if (codepoint <= 0x7f)
        *out++ = (char)codepoint;
    else if (codepoint <= 0x7ff)
    {
        *out++ = (char)(0xc0 | ((codepoint >> 6) & 0x1f));
        *out++ = (char)(0x80 | (codepoint & 0x3f));
    }
    else if (codepoint <= 0xffff)
    {
        *out++ = (char)(0xe0 | ((codepoint >> 12) & 0x0f));
        *out++ = (char)(0x80 | ((codepoint >> 6) & 0x3f));
        *out++ = (char)(0x80 | (codepoint & 0x3f));
    }
    else
    {
        *out++ = (char)(0xf0 | ((codepoint >> 18) & 0x07));
        *out++ = (char)(0x80 | ((codepoint >> 12) & 0x3f));
        *out++ = (char)(0x80 | ((codepoint >> 6) & 0x3f));
        *out++ = (char)(0x80 | (codepoint & 0x3f));
    }
    *out = 0;
}

Note that this function does no error checking, so if you give it an input outside the valid Unicode range of 0 to 0x10ffff it will generate an incorrect (but still valid) UTF-8 sequence.

Upvotes: 1

Remy Lebeau
Remy Lebeau

Reputation: 597166

Unicode character Ģ has e hex value of \x0122, so I would expect bytes #3 and #4 to be \x01 and \x22 respectively.

Where do c4 and a2 come from?

In Unicode, Ģ is codepoint U+0122 LATIN CAPITAL LETTER G WITH CEDILLA, which in UTF-8 is encoded as bytes 0xC4 0xA2.

Either your source file is saved as UTF-8, or your compiler is configured to save string literals in UTF-8. Either way, in your str1 string, the literal Ģ is stored as UTF-8. Thus:

unsigned char *str1 = "abcĢ";

is roughly equivalent to this:

unsigned char literal[] = {'a', 'b', 'c', 0xC4, 0xA2, '\0'};
unsigned char *str1 = &literal[0];

In an escape sequence, the entire sequence represents a single numeric value. So, \x01 and \x22 represent the individual numeric values 0x01 hex (1 decimal) and 0x22 hex (34 decimal), respectively. Thus:

unsigned char *str2 = "abc\x01\x22";

is roughly equivalent to this:

unsigned char literal[] = {'a', 'b', 'c', 0x01, 0x22, '\0'};
unsigned char *str2 = &literal[0];

You are simply outputting the raw bytes of the strings that str1 and str2 are pointing at.

The escape sequence \u0122 represents the numeric value 0x0122 hex (290 decimal), which in Unicode is codepoint U+0122, hence C4 A2 in UTF-8. So, if you have an input string like this:

const char *str = "abc\\u0122"; // {'a', 'b', 'c', '\', 'u', '0', '1', '2', '2', '\0'}

And you want to decode it to UTF-8, you would need to detect the "\u" prefix, extract the following "0122" substring, parse it as a hex number into an integer, interpret that integer as a Unicode codepoint, and convert it to UTF-8 (a, b, and c are already valid chars as-is in UTF-8).

Upvotes: 3

Related Questions