Transform byte array to string while supporting different encodings

Question

Let's say I have read the binary content of a text file into a std::vector and I want to transform these bytes into a string representation.

As long as the bytes are encoded using a single-byte encoding (ASCII for example), a transformation to std::string is pretty straightforward:

std::string transformToString(std::vector bytes)
{
  std::string str;
  
  str.assign(
    reinterpret_cast(const_cast(bytes.data())),
    data.size() / sizeof(std::string::value_type)
  );

  return str;
}

As soon as the bytes are encoded in some unicode format, things get a little bit more complicated.

As far as I know, C++ supports additional string types for unicode strings. These are std::u8string for UTF-8, std::u16string for UTF-16 and std::u32string for UTF-32.

Problem 1: Let's say the bytes are encoded in UTF-8. How can I create a std::u8string from these bytes in the first place? Also, how do I know the length of the string since there can be code points encoded in multiple bytes?

Problem 2: I've seen, that UTF-16 and UTF-32 support both big-endian and little-endian byte order. Let's say the bytes are encoded in UTF-16 BE or UTF-16 LE. How can I create a std::u16string from the bytes (and how can I specify the byte order for transformation)? I am looking for something like std::u16string u16str = std::u16string::from_bytes(bytes, byte_order::big_endian);.

Problem 3: Are the previously listed types of unicode string already aware of a byte order mark or does the byte order mark (if present) need to be processed separately? Since the said string types are just char8_t, char16_t and char32_t templated on a std::basic_string, I assume, that processing of a byte order mark is not supported.

Clarification: Please note, that I do not want to do any conversions. Almost every article I found was about how to convert UTF-8 strings to other encodings and vice-versa. I just want to get the string representation of the specified byte array. Therefore, as the user/programmer, I must be aware of the encoding of the bytes to get the correct representation. For example:

The bytes are encoded in UTF-8 (e.g. 41 42 43 (ABC)). I try to transform them to a std::u8string. The transformation was correct, the string is ABC.
The bytes are encoded in UTF-8 (e.g. 41 42 43 (ABC)). I try to transform them to a std::u16string. The transformation fails or the resulting string is not correct.

Transform byte array to string while supporting different encodings

Answers (1)

Related Questions