Reputation: 73
Currently, I have to deal with Unicode in C++ 11 (Linux environment). UTF-8 is used as default encoding. Tasks that I need:
What library should I use to achieve the best result?
Thank you very much. Looking forward to hearing from you soon.
Upvotes: 0
Views: 1696
Reputation: 15154
For the regex/replace/search functions, I’ve previously used PCRE. This is designed to work with UTF-8 strings. You might be able to work with STL regular expressions, but not in any portable way. (Windows, in particular, does not support UTF-8 locales.)
Iterating through a UTF-8 string is even more complicated than you describe, if you need to support combining marks or the zero-width joiner! You write that é
is one character, but it might be two Unicode codepoints: Latin small letter e + combining acute accent above. If you simply want to iterate through codepoints, you might use mbtowc()
or std::codecvt::do_in
from the Standard Library. If you need to iterate through graphemes, the most portable way to do that is with ICU.
Regular string concatenation should work, and the standard library has mblen()
for length. This isn’t completely portable, because the multibyte encoding does not have to be UTF-8 (although there is a standard set of conversion functions).
Upvotes: 2