Reputation: 6272
Lets say I have a std::string
, but the data is encoded in UTF-16.
How could I copy that data into a std::wstring
, not modifying the data at all?
Also, I can't just use std::wstring
because I'm retrieving a text file online and checking the Content-Type
header field to determine encoding. But using std::string
to receive the data.
Upvotes: 1
Views: 2793
Reputation: 88155
So you've stuck a series of bytes representing a UTF-16 encoded string into a std::string
. Presumably you're doing something like de-serializing bytes that represent UTF-16, and the API for retrieving bytes to be de-serialized specifies std::string. I don't think that's the best design, but you'll handle converting it to a wstring the same as you'd handle converting the bytes to float or anything else; validate the byte buffer and then cast it:
char c[] = "\0a\0b\xd8\x3d\xdc\x7f";
std::string buf(std::begin(c),std::end(c));
assert(0==buf.size()%2);
std::wstring utf16(reinterpret_cast<wchar_t const *>(buf.data()),buf.size()/sizeof(wchar_t));
// also validate that each code unit is legal, and that there are no isolated surrogates
Things to keep in mind:
* UTF-16 doesn't actually meet the C++ language's requirements for a wchar_t encoding, but some platforms use it regardless. This causes an issue with some standard APIs that are supposed to deal in codepoints but can't simply because a wchar_t that represents a UTF-16 code unit cannot represent all the platform's codepoints.
Here's an implementation that doesn't rely on platform specific details and requires nothing more than that wchar_t be large enough to hold UTF-16 code units and that each char holds exactly 8 bits of of a UTF-16 code units. It doesn't actually validate the UTF-16 data though.
#include <string>
#include <cassert>
#include <iterator>
#include <algorithm>
#include <iostream>
enum class endian {
big,little,unknown
};
std::wstring deserialize_utf16be(std::string const &s) {
assert(0==s.size()%2);
std::wstring ws;
for(size_t i=0;i<s.size();++i)
if(i%2)
ws.back() = ws.back() | ((unsigned char)s[i] & 0xFF);
else
ws.push_back(((unsigned char)s[i] & 0xFF) << 8);
return ws;
}
std::wstring deserialize_utf16le(std::string const &s) {
assert(0==s.size()%2);
std::wstring ws;
for(size_t i=0;i<s.size();++i)
if(i%2)
ws.back() = ws.back() | (((unsigned char)s[i] & 0xFF) << 8);
else
ws.push_back((unsigned char)s[i] & 0xFF);
return ws;
}
std::wstring deserialize_utf16(std::string s, endian e=endian::unknown) {
static_assert(std::numeric_limits<wchar_t>::max() >= 0xFFFF,"wchar_t must be large enough to hold UTF-16 code units");
static_assert(CHAR_BIT>=8,"char must hold 8 bits of UTF-16 code units");
assert(0==s.size()%2);
if(endian::big == e)
return deserialize_utf16be(s);
if(endian::little == e)
return deserialize_utf16le(s);
if(2<=s.size() && ((unsigned char)s[0])==0xFF && ((unsigned char)s[1])==0xFE)
return deserialize_utf16le(s.substr(2));
if(2<=s.size() && ((unsigned char)s[0])==0xfe && ((unsigned char)s[1])==0xff)
return deserialize_utf16be(s.substr(2));
return deserialize_utf16be(s);
}
int main() {
char c[] = "\xFF\xFE\x61\0b\0\x3d\xd8\x7f\xdc";
std::string buf(std::begin(c),std::end(c)-1);
std::wstring utf16 = deserialize_utf16(buf);
std::cout << std::hex;
std::copy(begin(utf16),end(utf16),std::ostream_iterator<int>(std::cout," "));
std::cout << "\n";
}
Upvotes: 0
Reputation: 145279
It there is a BOM (Byte Order Mark) at the start then you check that to determine the byte order. Otherwise it's best if you know the byte order, i.e., does least significant or most significant byte come first. If you don't know the byte order and have no BOM, then you just have to try one or both and apply some statistical test and/or involve a Human Decision Maker (HDM).
Let's say that this Little Endian byte order, i.e. least significant byte first.
Then for each pair of bytes do e.g.
w.push_back( (UnsignedChar( s[2*i + 1] ) << 8u) | UnsignedChar( s[2*i] ) );
where w
is a std::wstring
, i
is an index of wide chars < s.length()/2
, UnsignedChar
is a typedef
of unsigned char
, s
is a std::string
holding the data, and 8 is the number of bits per byte, i.e. you have to assume or statically assert that CHAR_BITS
from the <limits.h>
header is 8.
Upvotes: 1
Reputation: 308206
std::wstring PackUTF16(const std::string & input)
{
if (input.size() % 2 != 0)
throw std::invalid_argument("input length must be even");
std::wstring result(input.size() / 2, 0);
for (int i = 0; i < result.size(); ++i)
{
result[i] = (input[2*i+1] & 0xff) << 8 | (input[2*i] & 0xff); // for little endian
//result[i] = (input[2*i] & 0xff) << 8 | (input[2*i+1] & 0xff); // for big endian
}
return result;
}
Upvotes: 2
Reputation: 26930
Try this one :
static inline std::wstring charToWide(const std::string & s_in)
{
const char * cs = s_in.c_str();
size_t aSize;
if( ::mbsrtowcs_s(&aSize, NULL, 0, &cs, 0, NULL) != 0)
{
throw std::exception("Cannot convert string");
}
std::vector<wchar_t> aBuffer(aSize);
size_t aSizeSec;
if (::mbstowcs_s(&aSizeSec, &aBuffer[0], aSize, cs, aSize) != 0)
{
throw std::exception("Cannot convert string");
}
return std::wstring(&aBuffer[0], aSize - 1);
}
Upvotes: 1