Reputation: 1537

Decode utf8 entities from json into utf8 C++

I have a string with utf8 entities (I'm not sure I named it right):

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441";

How could I convert it into something more readable? I use g++ with C++11 support, but after couple of hours digging in std::codecvt manual I get no result:

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441";

wstring_convert<codecvt_utf8_utf16<char16_t>,char16_t> convert; 
string dest = convert.to_bytes(std);

returns nightmare stacktrace started with:

error: no matching function for call to ‘std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>::to_bytes(std::string&)

I hope there is another way.

Upvotes: 2

Answers (2)

Remy Lebeau

Reputation: 596527

First off, your use of std::wstring_convert is backwards. You have a UTF-8 encoded std::string that you want to convert to a wide Unicode string. You are getting the compiler error because to_bytes() does not take a std::string as input. It takes a std::wstring_convert::wide_string as input (which is std::u16string in your case, due to your use of char16_t in the specialization), so you need to use from_bytes() instead of to_bytes():

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441";

std::wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t> convert; 
std::u16string dest = convert.from_bytes(std);

Now, that being said, Section 9 of the JSON specification states:

9 String

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters.

\" represents the quotation mark character (U+0022).

\\ represents the reverse solidus character (U+005C).

\/ represents the solidus character (U+002F).

\b represents the backspace character (U+0008).

\f represents the form feed character (U+000C).

\n represents the line feed character (U+000A).

\r represents the carriage return character (U+000D).

\t represents the character tabulation character (U+0009).

So, for example, a string containing only a single reverse solidus character may be represented as "\\".

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. Hexadecimal digits can be digits (U+0030 through U+0039) or the hexadecimal letters A through F in uppercase (U+0041 through U+0046) or lowercase (U+0061 through U+0066). So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

The following four cases all produce the same result:

"\u002F"

"\u002f"

"\/"

"/"

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

The raw JSON data itself may be encoded in UTF-8 (the most common encoding), UTF-16, etc. But regardless of the encoding used, the character sequence "\u0418\u043d\u0434\u0435\u043a\u0441" represents the UTF-16 codeunit sequence U+0418 U+043d U+0434 U+0435 U+043a U+0441, which is the Unicode character string "Индекс".

If you use an actual JSON parser (such as JSON for Modern C++, jsoncpp, RapidJSON, etc), it will parse the UTF-16 codeunit values for you and return readable Unicode strings.

But, if you are processing the JSON data manually, then you will have to manually decode any \x and \uXXXX escape sequences. std::wstring_convert cannot do that for you. It can only convert the JSON from std::string to std::wstring/std:::u16string, if that makes it easier for you to parse the data. However, you still have to parse the content of the JSON separately.

Afterwards, if so desired, you can use std::wstring_convert to convert any extracted std::wstring/std::u16string strings back to UTF-8 to save memory.

Upvotes: 4

Mike Lischke

Reputation: 53357

What you see are not entities but code points. You are defining characters via Unicode escape sequences and the compiler automatically converts them to UTF-8. A typical way to convert that to UTF-16 and vice versa is this:

static std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string ws2s(const std::wstring &wstr) {
  std::string narrow = converter.to_bytes(wstr);
  return narrow;
}

std::wstring s2ws(const std::string &str) {
  std::wstring wide = converter.from_bytes(str);
  return wide;
}

Of course you cannot convert the original string into another string of the same type (std::string) as it cannot hold such characters. This is why the UTF-16 code was converted to UTF-8 by your compiler in the first place.

Upvotes: 0

Decode utf8 entities from json into utf8 C++

Answers (2)

Related Questions