Reputation: 5533
It is known that the standard library of C++11 allows to easily convert a string from UTF-8 encoding to UTF-16. However, the following code successfully converts invalid UTF-8 input (at least under MSVC2010):
#include <codecvt>
#include <locale>
#include <string>
int main() {
std::string input = "\xEA\x8E\x97" "\xE0\xA8\x81" "\xED\xAE\x8D";
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> converter;
try {
std::u16string output = converter.from_bytes(input.data());
printf("Converted successfully\n");
}
catch(std::exception &e) {
printf("Error: %s\n", e.what());
}
}
The string here contains 9 bytes, 3 code points. The last code point is 0xDB8D, which is invalid (fits into the range of surrogates).
Is it possible to check UTF-8 string for perfect validity using only standard library of modern C++? Here I mean that all the invalid cases as described in wikipedia article are not allowed.
Upvotes: 7
Views: 3408
Reputation: 11
In the official UTF-8 doc https://www.ietf.org/rfc/rfc3629.txt
Upvotes: 0