Reputation: 57
Let's say I have read the binary content of a text file into a std::vector<std::uint8_t>
and I want to transform these bytes into a string representation.
As long as the bytes are encoded using a single-byte encoding (ASCII for example), a transformation to std::string
is pretty straightforward:
std::string transformToString(std::vector<std::uint8_t> bytes)
{
std::string str;
str.assign(
reinterpret_cast<std::string::value_type*>(const_cast<std::uint8_t*>(bytes.data())),
data.size() / sizeof(std::string::value_type)
);
return str;
}
As soon as the bytes are encoded in some unicode format, things get a little bit more complicated.
As far as I know, C++ supports additional string types for unicode strings. These are std::u8string
for UTF-8, std::u16string
for UTF-16 and std::u32string
for UTF-32.
Problem 1: Let's say the bytes are encoded in UTF-8. How can I create a std::u8string
from these bytes in the first place? Also, how do I know the length of the string since there can be code points encoded in multiple bytes?
Problem 2: I've seen, that UTF-16 and UTF-32 support both big-endian and little-endian byte order. Let's say the bytes are encoded in UTF-16 BE or UTF-16 LE. How can I create a std::u16string
from the bytes (and how can I specify the byte order for transformation)? I am looking for something like std::u16string u16str = std::u16string::from_bytes(bytes, byte_order::big_endian);
.
Problem 3: Are the previously listed types of unicode string already aware of a byte order mark or does the byte order mark (if present) need to be processed separately? Since the said string types are just char8_t
, char16_t
and char32_t
templated on a std::basic_string
, I assume, that processing of a byte order mark is not supported.
Clarification: Please note, that I do not want to do any conversions. Almost every article I found was about how to convert UTF-8 strings to other encodings and vice-versa. I just want to get the string representation of the specified byte array. Therefore, as the user/programmer, I must be aware of the encoding of the bytes to get the correct representation. For example:
41 42 43
(ABC
)). I try to transform them to a std::u8string
. The transformation was correct, the string is ABC
.41 42 43
(ABC
)). I try to transform them to a std::u16string
. The transformation fails or the resulting string is not correct.Upvotes: 1
Views: 1254
Reputation: 76628
Your transformToString
is (more or less) correct only if uint8_t
is unsigned char
, which however is the case on every platform I know.
It is unnecessary to do the multiple casts you are doing. The whole cast sequence is not an aliasing violation only if you are casting from unsigned char*
to char*
(and char
is always the value type of std::string
). In particular there is no const
involved. I also say "more or less", because while this is probably supposed to work specifically when casting between signed/unsigned variants of the same element type, the standard currently doesn't actually specify the pointer arithmetic on the resulting pointer (which I guess is a defect).
However there is a much safer way that doesn't involve dangerous casts or potential for length mismatch:
str.assign(std::begin(bytes), std::end(bytes));
You can use exactly the same line as above to convert to any other std::basic_string
specialization, but the important point is that it will simply copy individual bytes as individual code units, not considering encoding or endianess in any way.
Problem 1: Let's say the bytes are encoded in UTF-8. How can I create a std::u8string from these bytes in the first place? Also, how do I know the length of the string since there can be code points encoded in multiple bytes?
You create the string exactly with the same line I showed above. In this case your approach would be wrong if you just replace str
's type because char8_t
cannot alias unsigned char
and would therefore be an aliasing violation resulting in undefined behavior.
A std::u8string
holds a sequence of UTF-8 code units (by convention). To get individual code points you can convert to UTF-32. There is std::mbrtoc32
from the C standard library, which relies on the C locale being set as UTF-8 (and requires conversion back to a char
array first) and there is codecvt_utf8<char32_t>
from the C++ library, which is however deprecated and no replacement has been decided on yet.
There are no functions in the standard library that actually interpret the sequence of code units in u8string
as code points. (e.g. .size()
is the number of code units, not the number of code points).
Problem 2: I've seen, that UTF-16 and UTF-32 support both big-endian and little-endian byte order. Let's say the bytes are encoded in UTF-16 BE or UTF-16 LE. How can I create a std::u16string from the bytes (and how can I specify the byte order for transformation)? I am looking for something like std::u16string u16str = std::u16string::from_bytes(bytes, byte_order::big_endian);.
There is nothing like that directly in the standard library. A u16string
holds 16bit code units of type char16_t
as values. What endianess or in general what representation is used for this type is an implementation detail, but you can expect it to be equal to that of other basic types. Since C++20 there is std::endian
to indicate the endianess of all scalar types if applicable and std::byteswap
which can be used to swap byte order if the endianess doesn't match the source endianess. However, you would need to manually iterate over the vector and form char16_t
s from pairs of bytes by bitwise operations anyway, so I am not sure whether this is all that helpful.
All of the above assumes that the original data is actually UTF-16 encoded. If that is not the case you need to convert from the original encoding to UTF-16 for which there are equivalent functions as in the UTF-32 case mentioned above.
Problem 3: Are the previously listed types of unicode string already aware of a byte order mark or does the byte order mark (if present) need to be processed separately? Since the said string types are just char8_t, char16_t and char32_t templated on a std::basic_string, I assume, that processing of a byte order mark is not supported.
The types simply store sequences of code units. They do not care what they represent (e.g. whether they represent a BOM). Because they store code units, not bytes, the BOM wouldn't have any meaning in processing them anyway.
Upvotes: 0