jmasterx
jmasterx

Reputation: 54123

Getting the actual length of a UTF-8 encoded std::string?

My std::string is UTF-8 encoded so obviously, str.length() returns the wrong result.

I found this information but I'm not sure how I can use it to do this:

The following byte sequences are used to represent a character. The sequence to be used depends on the UCS code number of the character:

   0x00000000 - 0x0000007F:
       0xxxxxxx

   0x00000080 - 0x000007FF:
       110xxxxx 10xxxxxx

   0x00000800 - 0x0000FFFF:
       1110xxxx 10xxxxxx 10xxxxxx

   0x00010000 - 0x001FFFFF:
       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

How can I find the actual length of a UTF-8 encoded std::string? Thanks

Upvotes: 44

Views: 43979

Answers (12)

PhotonFalcon
PhotonFalcon

Reputation: 924

Most of my personal C library code has only been really tested in English, but here is how I've implemented my utf-8 string length function. I originally based it on the bit pattern described in this wiki page table. Now this isn't the most readable code, but my intent was to remove any branching from the loop. Also sorry for this being C code when asking for C++, it should translate over to std::string in C++ pretty easily though with some slight modifications. The below functions are copied from my website if you're interested.

size_t utf8len(const char* const str) {
    size_t len = 0;
    for (size_t i = 0; *str != 0; ++len) {
        int v0 = (*str & 0x80) >> 7;
        int v1 = (*str & 0x40) >> 6;
        int v2 = (*str & 0x20) >> 5;
        int v3 = (*str & 0x10) >> 4;
        str += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
    }
    return len;
}

Note that this does not validate any of the bytes (much like all the other suggested answers here). Personally I would separate string validation out of my string length function as that is not it's responsibility. If we were to move string validation to another function we could have the validation done something like the following.

bool utf8valid(const char* const str) {
    if (str == NULL)
        return false;
    const char* c = str;
    bool valid = true;
    for (size_t i = 0; c[0] != 0 && valid;) {
        valid = (c[0] & 0x80) == 0
            || ((c[0] & 0xE0) == 0xC0 && (c[1] & 0xC0) == 0x80)
            || ((c[0] & 0xF0) == 0xE0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80)
            || ((c[0] & 0xF8) == 0xF0 && (c[1] & 0xC0) == 0x80 && (c[2] & 0xC0) == 0x80 && (c[3] & 0xC0) == 0x80);
        int v0 = (c[0] & 0x80) >> 7;
        int v1 = (c[0] & 0x40) >> 6;
        int v2 = (c[0] & 0x20) >> 5;
        int v3 = (c[0] & 0x10) >> 4;
        i += 1 + v0 * v1 + v0 * v1 * v2 + v0 * v1 * v2 * v3;
        c = str + i;
    }
    return valid;
}

If you are going for readability, I'll admit that other suggestions are a quite bit more readable.

Update: Thanks to Max Brauer (who left a comment) for simplifying the code a little bit. Here is what the utf8len would become with his simplification.

size_t utf8len(const char* str) {
    size_t len = 0;
    for (size_t i = 0; *str != 0; ++len) {
        int v01 = ((*str & 0x80) >> 7) & ((*str & 0x40) >> 6);
        int v2 = (*str & 0x20) >> 5;
        int v3 = (*str & 0x10) >> 4;
        str += 1 + ((v01 << v2) | (v01 & v3));
    }
    return len;
}

Upvotes: 3

phuclv
phuclv

Reputation: 41834

Most (if not all) of the other answers only give the number of code points and completely fail for combining characters, emojis or more complex scripts. For example here's an example output from user2781185's solution above after modifying slightly for a demo on Godbolt:

Length (char-values):  5, length (code points):  4. String: café
Length (char-values):  6, length (code points):  5. String: café
Length (char-values): 15, length (code points):  5. String: 가각
Length (char-values): 24, length (code points):  8. String: ဂ︀င︀⋚︀丸︀
Length (char-values): 47, length (code points): 13. String: 🏳️‍🌈👨‍👩‍👦‍👦🇪🇺
Length (char-values): 74, length (code points): 21. String: 👨‍👩‍👦‍👦😶‍🌫️👩🏻‍❤️‍💋‍👩🏿
Length (char-values): 21, length (code points):  7. String: ফোল্ডার
Length (char-values): 18, length (code points):  8. String: dര്‍g1️⃣
Length (char-values): 18, length (code points):  6. String: Xല്‍🇺🇳

As you can see, the lengths returned are just the number of code points and made no relation whatsoever to what users see (“user-perceived characters”). Even the 2 café strings are different

To get the actual number of visible characters (called glyphs) you have to use a proper library like Boost.Unicode/Boost.Text/Boost.Locale or the official ICU from the Unicode Consortium to normalize the string to a non-combining form like NFC or NFKC first, then count the length in glyph

This is the sample code on how to do that:

#include <unicode/schriter.h>
#include <unicode/brkiter.h>

#include <iostream>
#include <cassert>
#include <memory>

int main()
{
    const UnicodeString str(L"नमस्ते café café 😶‍🌫️🏃🏻‍♀️");
    UErrorCode errorCode;
    nfkc.normalize(str, errorCode); // ALWAYS NORMALIZE THE STRINGS FIRST

    {
        UErrorCode err = U_ZERO_ERROR;
        std::unique_ptr<BreakIterator> iter(
            BreakIterator::createCharacterInstance(Locale::getDefault(), err));
        assert(U_SUCCESS(err));
        iter->setText(str);

        int count = 0;
        while(iter->next() != BreakIterator::DONE) ++count;
        std::cout << count << std::endl;
    }

    return 0;
}

Another probably simply library for that purpose is yhirose/cpp-unicodelib

std::u32string s = U"hello☺😆";
auto normalized = unicode::to_nfkc(s.c_str(), s.length());
std::cout << "Length: "
          << unicode::grapheme_count(normalized.c_str(), normalized.length())

See also:

Upvotes: 0

Gem Taylor
Gem Taylor

Reputation: 5613

A slightly lazy approach would be to only count lead bytes, but visit every byte. This saves the complexity of decoding the various lead byte sizes, but obviously you pay to visit all the bytes, though there usually aren't that many (2x-3x):

size_t utf8Len(std::string s)
{
  return std::count_if(s.begin(), s.end(),
    [](char c) { return (static_cast<unsigned char>(c) & 0xC0) != 0x80; } );
}

Note that certain code values are illegal as lead bytes, those that represent bigger values than the 20 bits needed for extended unicode, for example, but then the other approach would not know how to deal with that code, anyway.

Upvotes: 1

user2781185
user2781185

Reputation: 363

C++ knows nothing about encodings, so you can't expect to use a standard function to do this.

The standard library indeed does acknowledge the existence of character encodings, in the form of locales. If your system supports a locale, it is very easy to use the standard library to compute the length of a string. In the example code below I assume your system supports the locale en_US.utf8. If I compile the code and execute it as "./a.out ソニーSony", the output is that there were 13 char-values and 7 characters. And all without any reference to the internal representation of UTF-8 character codes or having to use 3rd party libraries.

#include <clocale>
#include <cstdlib>
#include <iostream>
#include <string>

using namespace std;

int main(int argc, char *argv[])
{
  string str(argv[1]);
  unsigned int strLen = str.length();
  cout << "Length (char-values): " << strLen << '\n';
  setlocale(LC_ALL, "en_US.utf8");
  unsigned int u = 0;
  const char *c_str = str.c_str();
  unsigned int charCount = 0;
  while(u < strLen)
  {
    u += mblen(&c_str[u], strLen - u);
    charCount += 1;
  }
  cout << "Length (characters): " << charCount << endl; 
}

Upvotes: 24

user4153980
user4153980

Reputation:

Just another naive implementation to count chars in UTF-8 string

int utf8_strlen(const string& str)
{
    int c,i,ix,q;
    for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
    {
        c = (unsigned char) str[i];
        if      (c>=0   && c<=127) i+=0;
        else if ((c & 0xE0) == 0xC0) i+=1;
        else if ((c & 0xF0) == 0xE0) i+=2;
        else if ((c & 0xF8) == 0xF0) i+=3;
        //else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
        //else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
        else return 0;//invalid utf8
    }
    return q;
}

Upvotes: 0

twotrees
twotrees

Reputation: 11

This code I'm porting from php-iconv to c++, you need use iconv first, hope usefull:

// porting from PHP
// http://lxr.php.net/xref/PHP_5_4/ext/iconv/iconv.c#_php_iconv_strlen
#define GENERIC_SUPERSET_NBYTES 4
#define GENERIC_SUPERSET_NAME   "UCS-4LE"

UInt32 iconvStrlen(const char *str, size_t nbytes, const char* encode)
{
    UInt32 retVal = (unsigned int)-1;

    unsigned int cnt = 0;

    iconv_t cd = iconv_open(GENERIC_SUPERSET_NAME, encode);
    if (cd == (iconv_t)(-1))
        return retVal;

    const char* in;
    size_t  inLeft;

    char *out;
    size_t outLeft;

    char buf[GENERIC_SUPERSET_NBYTES * 2] = {0};

    for (in = str, inLeft = nbytes, cnt = 0; inLeft > 0; cnt += 2) 
    {
        size_t prev_in_left;
        out = buf;
        outLeft = sizeof(buf);

        prev_in_left = inLeft;

        if (iconv(cd, &in, &inLeft, (char **) &out, &outLeft) == (size_t)-1) {
            if (prev_in_left == inLeft) {
                break;
            }
        }
    }
    iconv_close(cd);

    if (outLeft > 0)
        cnt -= outLeft / GENERIC_SUPERSET_NBYTES;

    retVal = cnt;
    return retVal;
}

UInt32 utf8StrLen(const std::string& src)
{
    return iconvStrlen(src.c_str(), src.length(), "UTF-8");
}

Upvotes: 0

Roger Pate
Roger Pate

Reputation:

This is a naive implementation, but it should be helpful for you to see how this is done:

std::size_t utf8_length(std::string const &s) {
  std::size_t len = 0;
  std::string::const_iterator begin = s.begin(), end = s.end();
  while (begin != end) {
    unsigned char c = *begin;
    int n;
    if      ((c & 0x80) == 0)    n = 1;
    else if ((c & 0xE0) == 0xC0) n = 2;
    else if ((c & 0xF0) == 0xE0) n = 3;
    else if ((c & 0xF8) == 0xF0) n = 4;
    else throw std::runtime_error("utf8_length: invalid UTF-8");

    if (end - begin < n) {
      throw std::runtime_error("utf8_length: string too short");
    }
    for (int i = 1; i < n; ++i) {
      if ((begin[i] & 0xC0) != 0x80) {
        throw std::runtime_error("utf8_length: expected continuation byte");
      }
    }
    len += n;
    begin += n;
  }
  return len;
}

Upvotes: 8

Charles Salvia
Charles Salvia

Reputation: 53299

You should probably take the advice of Omry and look into a specialized library for this. That said, if you just want to understand the algorithm to do this, I'll post it below.

Basically, you can convert your string into a wider-element format, such as wchar_t. Note that wchar_t has a few portability issues, because wchar_t is of varying size depending on your platform. On Windows, wchar_t is 2 bytes, and therefore ideal for representing UTF-16. But on UNIX/Linux, it's four-bytes and is therefore used to represent UTF-32. Therefore, for Windows this will only work if you don't include any Unicode codepoints above 0xFFFF. For Linux you can include the entire range of codepoints in a wchar_t. (Fortunately, this issue will be mitigated with the C++0x Unicode character types.)

With that caveat noted, you can create a conversion function using the following algorithm:

template <class OutputIterator>
inline OutputIterator convert(const unsigned char* it, const unsigned char* end, OutputIterator out) 
{
    while (it != end) 
    {
        if (*it < 192) *out++ = *it++; // single byte character
        else if (*it < 224 && it + 1 < end && *(it+1) > 127) { 
            // double byte character
            *out++ = ((*it & 0x1F) << 6) | (*(it+1) & 0x3F);
            it += 2;
        }
        else if (*it < 240 && it + 2 < end && *(it+1) > 127 && *(it+2) > 127) { 
            // triple byte character
            *out++ = ((*it & 0x0F) << 12) | ((*(it+1) & 0x3F) << 6) | (*(it+2) & 0x3F);
            it += 3;
        }
        else if (*it < 248 && it + 3 < end && *(it+1) > 127 && *(it+2) > 127 && *(it+3) > 127) { 
            // 4-byte character
            *out++ = ((*it & 0x07) << 18) | ((*(it+1) & 0x3F) << 12) |
                ((*(it+2) & 0x3F) << 6) | (*(it+3) & 0x3F);
            it += 4;
        }
        else ++it; // Invalid byte sequence (throw an exception here if you want)
    }

    return out;
}

int main()
{
    std::string s = "\u00EAtre";
    cout << s.length() << endl;

    std::wstring output;
    convert(reinterpret_cast<const unsigned char*> (s.c_str()), 
        reinterpret_cast<const unsigned char*>(s.c_str()) + s.length(), std::back_inserter(output));

    cout << output.length() << endl; // Actual length
}

The algorithm isn't fully generic, because the InputIterator needs to be an unsigned char, so you can interpret each byte as having a value between 0 and 0xFF. The OutputIterator is generic, (just so you can use an std::back_inserter and not worry about memory allocation), but its use as a generic parameter is limited: basically, it has to output to an array of elements large enough to represent a UTF-16 or UTF-32 character, such as wchar_t, uint32_t or the C++0x char32_t types. Also, I didn't include code to convert character byte sequences greater than 4 bytes, but you should get the point of how the algorithm works from what's posted.

Also, if you just want to count the number of characters, rather than output to a new wide-character buffer, you can modify the algorithm to include a counter rather than an OutputIterator. Or better yet, just use Marcelo Cantos' answer to count the first-bytes.

Upvotes: 5

Lucas
Lucas

Reputation: 6348

I recommend you use UTF8-CPP. It's a header-only library for working with UTF-8 in C++. With this lib, it would look something like this:

int LenghtOfUtf8String( const std::string &utf8_string ) 
{
    return utf8::distance( utf8_string.begin(), utf8_string.end() ); 
}

(Code is from the top of my head.)

Upvotes: 4

Nemanja Trifunovic
Nemanja Trifunovic

Reputation: 24561

UTF-8 CPP library has a function that does just that. You can either include the library into your project (it is small) or just look at the function. http://utfcpp.sourceforge.net/

char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);

Upvotes: 0

Marcelo Cantos
Marcelo Cantos

Reputation: 185902

Count all first-bytes (the ones that don't match 10xxxxxx).

int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;

Upvotes: 75

Omry Yadan
Omry Yadan

Reputation: 33646

try to use an encoding library like iconv. it probably got the api you want.

an alternative is to implement your own utf8strlen which determines the length of each codepoint and iterate codepoints instead of characters.

Upvotes: 1

Related Questions