Reputation: 6027

Converting UTF-8 Characters to Upper/Lower case C++

I have a string that contains UTF-8 Characters, and I have a method that is supposed to convert every character to either upper or lower case, this is easily done with characters that overlap with ASCII, and obviously some characters cannot be converted, e.g. any Chinese character. However is there a good way to detect and convert other characters that can be Upper/Lower, e.g. all the greek characters? Also please note that I need to be able to do this on both Windows and Linux.

Thank you,

Upvotes: 6

Answers (3)

Davislor

Reputation: 15134

On Linux, or with a standard library that supports it, you would obtain a std::locale object for the appropriate locale, as uppercase conversion is locale-specific. Convert each UTF-8 character to a wchar_t, then call std::toupper() on it, then convert back to UTF-8. Note that the resulting string might be longer or shorter, and some ligatures might not work properly: ß to Ss in German is the example everyone keeps bringing up.

On Windows, this approach will work even less of the time, because wide characters are UTF-16 and not a fixed-width encoding (which violates the C++ language standard, but then maybe the standards committee shouldn't have tried to bluff Microsoft into breaking the Windows API). There is a ToUpper method in the CLR.

It is probably easier to use a portable library such as ICU.

Also make sure whether what you want is uppercase (capitalizing every letter) or titlecase (capitalizing the first letter of a string, or the first part of a ligature).

Upvotes: 0

tidwall

Reputation: 6949

Assuming that you have access to wctype.h, then convert your text to a 2-byte unicode string and use towupper(). Then convert it back to UTF-8.

Upvotes: 2

Alexandre C.

Reputation: 56956

Have a look at ICU.

Note that lower case to upper case functions are locale-dependant. Think about the turkish (ascii) letter I which gets "dotless lowercase i" and (ascii) i which gets "uppercase I with a dot".

Upvotes: 16

Converting UTF-8 Characters to Upper/Lower case C++

Answers (3)

Related Questions