Chris
Chris

Reputation: 63

UTF-8, sprintf, strlen, etc

I try to understand how to handle basic UTF-8 operations in C++.

Let's say we have this scenario: User inputs a name, it's limited to 10 letters (symbols in user's language, not bytes), it's being stored.

It can be done this way in ASCII.

// ASCII 
char * input; // user's input
char buf[11] // 10 letters + zero
snprintf(buf,11,"%s",input); buf[10]=0;
int len= strlen(buf); // return 10 (correct)

Now, how to do it in UTF-8? Let's assume it's up to 4 bytes charset (like Chinese).

// UTF-8
char * input; // user's input
char buf[41] // 10 letters * 4 bytes + zero
snprintf(buf,41,"%s",input); //?? makes no sense, it limits by number of bytes not letters
int len= strlen(buf); // return number of bytes not letters (incorrect)

Can it be done with standard sprintf/strlen? Are there any replacements of those function to use with UTF-8 (in PHP there was mb_ prefix of such functions IIRC)? If not, do I need to write those myself? Or maybe do I need to approach it another way?

Note: I would prefer to avoid wide characters solution...

EDIT: Let's limit it to Basic Multilingual Plane only.

Upvotes: 4

Views: 7414

Answers (4)

Serge Ballesta
Serge Ballesta

Reputation: 149155

I would prefer to avoid wide characters solution...

Wide characters are just not enough, because if you need 4 bytes for a single glyph, then that glyph is likely to be outside the Basic Multilingual Plane, and it will not be represented by a single 16 bits wchar_t character (assuming wchar_t is 16 bits wide which is just the common size).

You will have to use a true unicode library to convert the input to a list of unicode characters in their Normal Form C (canonical composition) or the compatibility equivalent (NFKC)(*) depending on whether for example you want to count one or two characters for the ligature (U+FB00). AFAIK, you best bet should be ICU.


(*) Unicode allows multiple representation for the same glyph, notably the normal composed form (NFC) and normal decomposed form (NFD). For example the french é character can be represented in NFC as U+00E9 or LATIN SMALL LETTER E WITH ACUTE or as U+0065 U+0301 or LATIN SMALL LETTER E followed with COMBINING ACUTE ACCENT (also displayed as ).

References and other examples on Unicode equivalence

Upvotes: 1

Artemy Vysotsky
Artemy Vysotsky

Reputation: 2734

If you do not want to count utf-8 chars by yourself - you can use temporary conversion to widechar to cut your input string. You do not need to store the intermediate values

#include <iostream>
#include <codecvt>
#include <string>
#include <locale>

std::string cutString(const std::string& in, size_t len)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt;
    auto wstring = cvt.from_bytes(in);
    if(len < wstring.length())
    {
        wstring = wstring.substr(0,len);
        return cvt.to_bytes(wstring);
    }    
    return in;
}
int main(){
    std::string test = "你好世界這是演示樣本";

    std::string res = cutString(test,5);
    std::cout << test << '\n' << res << '\n';

    return 0;
}

/****************
Output 
$ ./test
你好世界這是演示樣本
你好世界這
*/

Upvotes: 1

Mr.C64
Mr.C64

Reputation: 43034

strlen only counts the bytes in the input string, until the terminating NUL.

On the other hand, you seem interested in the glyph count (what you called "symbols in user's language").

The process is complicated by UTF-8 being a variable length encoding (as is, in a kind of lesser extent, also UTF-16), so code points can be encoded using one up to four bytes. And there are also Unicode combining characters to consider.

To my knowledge, there's nothing like that in the standard C++ library. However, you may have better luck using third party libraries like ICU.

Upvotes: 1

YSC
YSC

Reputation: 40150

std::strlen indeed considers only one byte characters. To compute the length of a unicode NUL terminated string, one can use std::wcslen instead.

Example:

#include <iostream>
#include <cwchar>
#include <clocale>

int main()
{
    const wchar_t* str = L"爆ぜろリアル!弾けろシナプス!パニッシュメントディス、ワールド!";

    std::setlocale(LC_ALL, "en_US.utf8");
    std::wcout.imbue(std::locale("en_US.utf8"));
    std::wcout << "The length of \"" << str << "\" is " << std::wcslen(str) << '\n';
}

Upvotes: 0

Related Questions