Reputation:

How to manage Unicode strings easily in C++

I want to get each character from a Unicode string. If this question is a bad one, I hope your understanding.

string str = "öp";
for (int i = 0; i < str.length(); i++) {
 cout << str[i] << endl;
}

In this case, str[0] is a broken character because the length of ö is 2. How can I manage it? I really appreciate your answers. Thank you.

Upvotes: 2

Answers (3)

Joop Eggen

Reputation: 109613

The "atomic" unit of a string object evidently is another string (containing a single code point) or an char32_t (Unicode code point). The string being the most usable as one can again compose it, and no UTF conversion is needed.

I am a bit rusty in C/C++, but something like:

string utf8_codepoint(const string& s, int i) {

    // Skip continuation bytes:
    while (s[i] & 0xC0 == 0x80) {
        ++i;
    }

    string cp = s[i];
    if (s[i] & 0xC0 == 0xC0) { // Start byte.
        ++i;
        while (s[i] & 0xC0 == 0x80) { // Continuation bytes.
            cp += s[i];
            ++i;
        }
    }
    return cp;
}

for (size_t i = 0; i < str.length(); i++)
   wcout << utf8_codepoint(str, i) << endl;

for (size_t i = 0; i < str.length(); ) {
   string cp = utf8_codepoint(str, i);
   i += cp.length();
   wcout << cp << endl;
}

Of course there are zero-width accents in Unicode that cannot be printed in solitary, but the same holds for control characters, or not having a font with full Unicode support (and hence a font of some 35 MB size).

Upvotes: 0

eerorika

Reputation: 238461

In order to insert characters (for example new-lines such as you attempt in the example) between characters of a UTF-8 string, you must only do so between complete grapheme clusters. Right now you add newline after an incomplete code point, which breaks the encoding.

The Unicode standard is here. See this section in particular:

3.9 Unicode Encoding Forms

UTF-8

Table 3-6. UTF-8 Bit Distribution

+----------------------------+------------+-------------+------------+-------------+
|        Scalar Value        | First Byte | Second Byte | Third Byte | Fourth Byte |
+----------------------------+------------+-------------+------------+-------------+
| 00000000 0xxxxxxx          | 0xxxxxxx   |             |            |             |
| 00000yyy yyxxxxxx          | 110yyyyy   | 10xxxxxx    |            |             |
| zzzzyyyy yyxxxxxx          | 1110zzzz   | 10yyyyyy    | 10xxxxxx   |             |
| 000uuuuu zzzzyyyy yyxxxxxx | 11110uuu   | 10uuzzzz    | 10yyyyyy   | 10xxxxxx    |
+----------------------------+------------+-------------+------------+-------------+

From these, we can devise the following algorithm to iterate code points:

for (int i = 0; i < str.length();) {
    std::cout << str[i];

    if(str[i] & 0x80) {
        std::cout << str[i + 1];
        if(str[i] & 0x20) {
            std::cout << str[i + 2];
            if(str[i] & 0x10) {
                std::cout << str[i + 3];
                i += 4;
            } else {
                i += 3;
            }
        } else {
            i += 2;
        }
    }  else {
        i += 1;
    }
    
    std::cout << std::endl;
}

This trivial algorithm is sufficient for your example if it is normalised in a composed form i.e. "ö" is a single code point. For general usage however, more complex algorithm is needed to distinguish grapheme clusters.

Furthermore, this trivial algorithm doesn't check for invalid sequences and may overflow the input string in such case. This is only a simple example not intended for production use. For production use, I would recommend using an external library.

Upvotes: 2

Serge Ballesta

Reputation: 149165

The problem is that utf-8 (not unicode) is a multi byte character encoding. Most common characters (the ansi character set) only use one single byte, but less common ones (notably emoticons) can use up to 4. But that is far from being the only problem.

If you only use characters from the Basic Multilingual Plane, and can be sure to never encounter combining ones, you can safely use std::wstring and wchar_t, because wchar_t is guaranteed to contain any characters from the BMP.

But in the generic case, Unicode is a mess. Even when using char32_t which can contain any unicode code point, you cannot be sure to have a bijection between unicode code points and graphemes (displayed characters). For example the LATIN SMALL LETTER E WITH ACUTE (é) is the Unicode character U+E9. But it can be represented in a decomposed form as U+65 U+0301, or LATIN SMALL LETTER E followed with a COMBINING ACUTE ACCENT. So even when using char32_t, you get 2 characters for one single grapheme, and it would be incorrect to split them:

wchar32_t eaccute = { 'e', 0x301, 0};

This is indeed a representation of é. You can copy and paste it to control that it is not the U+E9 character, but the decomposed one, but in printed form there cannot be any difference.

TL/DR: Except if you are sure to only use a subset of the Unicode charset that could be represented in a much shorter charset as ISO-8859-1 (Latin1), or equivalent, you have no simple way to know how to split a string in true characters.

Upvotes: 1

How to manage Unicode strings easily in C++

Answers (3)

Related Questions