user2793162
user2793162

Reputation:

Handle UTF-8 string

as I know linux uses UTF-8 encoding. This means I can use std::string for handling string right? Just the encoding will be UTF-8.

Now on UTF-8 we know some characters are 1 byte some 2,3.. bytes. My question is: how to you deal with UTF-8 encoded string on Linux using C++?

Particularly: how would you get length of string say in bytes (or number of characters)? How would you traverse the string? etc.

The reason I am asking is that as I said on UTF-8 characters may be more than one byte right? So obviously myString[7] and myString[8] - might not refer to two different characters. Also fact that UTF-8 string is ten bytes, doesn't say much about its number of characters right?

Upvotes: 5

Views: 4145

Answers (5)

Konrad Rudolph
Konrad Rudolph

Reputation: 545488

You cannot handle UTF-8 with std::string. string, despite its name, is only a container for (multi-) bytes. It is not a type for text storage (beyond the fact that a byte buffer can obviously store any object, including text). It doesn’t even store characters (char is a byte, not a character).

You need to venture outside the standard library if you want to actually handle (rather than just store) Unicode characters. Traditionally, this is done by libraries such as ICU.

However, while this is a mature library, its C++ interface sucks. A modern approach is taken in Ogonek. It’s not as well established and still work in progress, but provides a much nicer interface.

Upvotes: 6

Simon Richter
Simon Richter

Reputation: 29586

There are multiple concepts here:

  1. length of UTF-8 encoding in bytes
  2. number of Unicode code points used (= number of UTF-8 bytes outside the 0x80..0xbf range)
  3. number of glyphs ("characters" in Western languages)
  4. screen space occupied when displaying

Normally, you are only interested in 1. (for memory requirements) and 4. (for display), the others have no real application.

The amount of screen space can be queried from the rendering context. Note that this may change depending on context (for example, Arabic letters change shape at the beginning and end of words), so if you are doing text input, you may need to perform additional trickery to give users a consistent experience.

Upvotes: 2

Artem Agasiev
Artem Agasiev

Reputation: 175

I'm using libunistring library, which can help you deal with all your questions.

For example, here is simple string length (in utf-8 characters) function:

size_t my_utf8_strlen(uint8_t *str) {
    if (str == NULL) return 0;
    if ((*str) == 0) return 0;

    size_t length = 0;
    uint8_t *current = str;
    // UTF-8 character.
    ucs4_t ucs_c = UNINAME_INVALID;

    while (current && *current) {
        current = u8_next(&ucs_c, current);
        length++; 

        // Broken character.
        if (ucs_c == UNINAME_INVALID || ucs_c == 0xfffd) 
        return length - 1;
    }

    return length;
}

// Use case
std::string test;

// Loading some text in `test` variable.
// ...

std::cout << my_utf8_strlen(&test[0]) << std::endl;

Upvotes: 1

john
john

Reputation: 87944

You may want to convert the UTF-8 encoded strings to some kind of fixed width encoding prior to manipulating them. But that depends on what you are trying to do.

To get the length in bytes of a UTF-8 string that's just str.size(). To get the length in chars is slightly more difficult but you can get that by ignoring any byte in the string which has a value >= 0x80 and < 0xC0. In UTF-8 those values are always trailing bytes. So count the number of bytes like that and subtract it from the size of the string.

The above does ignore the issue of combining characters. It does rather depend on what your definition of character is.

Upvotes: 3

Kent Munthe Caspersen
Kent Munthe Caspersen

Reputation: 6888

You can determine it based on the major x bits of the first byte: UTF-8, Description

Upvotes: 0

Related Questions