segment_fault
segment_fault

Reputation: 11

Determine the character set of string

as default, the std::string in my machine is GBK, and the string i wrote in program is encoding with gbk, but sometimes i recive datas from server and the datas is encoding with UTF-8, I want to determine which the chatacter set the string is using. I saw the utf-8 and gbk encoding method, it's hard to complete it by self.

Upvotes: 0

Views: 958

Answers (1)

Remy Lebeau
Remy Lebeau

Reputation: 598134

To check if a std::string contains UTF-8 content, decode it as UTF-8 and see if it fails.

To check if a std::string contains GBK, decode it as GBK and see if it fails.

There are plenty of conversion libraries available, such as ICONV and ICU, which are usually preinstalled on most platforms. Or use platform specific APIs, like MultiByteToWideChar() on Windows (GBK is covered by codepages 936 and 54936, and UTF-8 is covered by codepage 65001).

Or just write your own decoder (UTF-8 only takes a few dozen lines of code). You can find details about the bit layouts of UTF-8 and GBK on Wikipedia.

Upvotes: 1

Related Questions