Numeric value of literal UTF-8 characters

Question

I'm working on a string unescaping function that converts literal sequences like \uxxxx (where xxxx is a hex value) into bytes of corresponding value. I am planning to have the function get he first two characters of the xxxx sequence, calculate the byte value, and to the same with the second sequence.

But I ran into an unexpected result with literal typed UTF-8 characters. The following illustrates my issue:

#include 

int main()
{
    unsigned char *str1 = "abcĢ";
    unsigned char *str2 = "abc\x01\x22";
    for (unsigned i = 0; i < 5; i++)
        printf ("String 1 character #%u: %x
", i, str1[i]);
    for (unsigned i = 0; i < 5; i++)
        printf ("String 2 character #%u: %x
", i, str2[i]);

    return 0;
}

Output:

String 1 character #0: 61
String 1 character #1: 62
String 1 character #2: 63
String 1 character #3: c4
String 1 character #4: a2
String 2 character #0: 61
String 2 character #1: 62
String 2 character #2: 63
String 2 character #3: 1
String 2 character #4: 22

Unicode character Ģ has e hex value of \x0122, so I would expect bytes #3 and #4 to be \x01 andx22 respectively.

Where do c4 and a2 come from? I guess I am not understanding how multi-byte characters in strings are encoded in C. Any help would be appreciated.

Mark Ransom · Accepted Answer

UTF-8 can't work in a simplistic way of breaking a large value into individual bytes, because it would be ambiguous. How would you tell the difference between "\u4142" (䅂) and the two character string "AB"?

The rules for producing a UTF-8 byte string from a Unicode code point number are quite simple, and eliminate the ambiguity. Given any sequence of byte values, it either defines unambiguous codepoints or it's an invalid sequence.

Here's a simple function that will convert a single Unicode codepoint value to a UTF-8 byte sequence.

void codepoint_to_UTF8(int codepoint, char * out)
/* out must point to a buffer of at least 5 chars. */
{
    if (codepoint <= 0x7f)
        *out++ = (char)codepoint;
    else if (codepoint <= 0x7ff)
    {
        *out++ = (char)(0xc0 | ((codepoint >> 6) & 0x1f));
        *out++ = (char)(0x80 | (codepoint & 0x3f));
    }
    else if (codepoint <= 0xffff)
    {
        *out++ = (char)(0xe0 | ((codepoint >> 12) & 0x0f));
        *out++ = (char)(0x80 | ((codepoint >> 6) & 0x3f));
        *out++ = (char)(0x80 | (codepoint & 0x3f));
    }
    else
    {
        *out++ = (char)(0xf0 | ((codepoint >> 18) & 0x07));
        *out++ = (char)(0x80 | ((codepoint >> 12) & 0x3f));
        *out++ = (char)(0x80 | ((codepoint >> 6) & 0x3f));
        *out++ = (char)(0x80 | (codepoint & 0x3f));
    }
    *out = 0;
}

Note that this function does no error checking, so if you give it an input outside the valid Unicode range of 0 to 0x10ffff it will generate an incorrect (but still valid) UTF-8 sequence.

Numeric value of literal UTF-8 characters

Answers (2)

Related Questions