stgatilov
stgatilov

Reputation: 5533

Check if UTF-8 string is valid in modern C++

It is known that the standard library of C++11 allows to easily convert a string from UTF-8 encoding to UTF-16. However, the following code successfully converts invalid UTF-8 input (at least under MSVC2010):

#include <codecvt>
#include <locale>
#include <string>

int main() {
    std::string input = "\xEA\x8E\x97" "\xE0\xA8\x81" "\xED\xAE\x8D";
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> converter;
    try {
        std::u16string output = converter.from_bytes(input.data());
        printf("Converted successfully\n");
    }
    catch(std::exception &e) {
        printf("Error: %s\n", e.what());
    }
}

The string here contains 9 bytes, 3 code points. The last code point is 0xDB8D, which is invalid (fits into the range of surrogates).

Is it possible to check UTF-8 string for perfect validity using only standard library of modern C++? Here I mean that all the invalid cases as described in wikipedia article are not allowed.

Upvotes: 7

Views: 3408

Answers (1)

Mechi
Mechi

Reputation: 11

In the official UTF-8 doc https://www.ietf.org/rfc/rfc3629.txt

  • The first octet of a multi-octet sequence indicates the number of octets in the sequence - so you determine if the length is correct.
  • The octet values C0, C1, F5 to FF never appear - so you check that they don't appear in the UTF-8 string

Upvotes: 0

Related Questions