Josh
Josh

Reputation: 6272

C++ copy data from std::string to std::wstring

Lets say I have a std::string, but the data is encoded in UTF-16.
How could I copy that data into a std::wstring, not modifying the data at all?

Also, I can't just use std::wstring because I'm retrieving a text file online and checking the Content-Type header field to determine encoding. But using std::string to receive the data.

Upvotes: 1

Views: 2793

Answers (4)

bames53
bames53

Reputation: 88155

So you've stuck a series of bytes representing a UTF-16 encoded string into a std::string. Presumably you're doing something like de-serializing bytes that represent UTF-16, and the API for retrieving bytes to be de-serialized specifies std::string. I don't think that's the best design, but you'll handle converting it to a wstring the same as you'd handle converting the bytes to float or anything else; validate the byte buffer and then cast it:

char c[] = "\0a\0b\xd8\x3d\xdc\x7f";
std::string buf(std::begin(c),std::end(c));
assert(0==buf.size()%2);
std::wstring utf16(reinterpret_cast<wchar_t const *>(buf.data()),buf.size()/sizeof(wchar_t));
// also validate that each code unit is legal, and that there are no isolated surrogates

Things to keep in mind:

  • This cast assumes that wchar_t is 16 bits, whereas most platforms use 32 bit wchar_t.
  • To be useful your APIs will need to be able to treat wchar_t strings as UTF-16, either because that's the the platform specified encoding for wchar_t* or because the APIs just follow that convention.
  • This cast assumes that the data matches the machine's endianess. You'll have to swap each UTF-16 code unit in the wstring otherwise. Under the UTF-16 encoding scheme if the initial bytes aren't 0xFF0xFE or 0xFE0xFF and in the absense of a higher level protocol then the UTF-16 encoding uses a big endian encoding.
  • std::begin(), std::end() and string::data() are C++11

* UTF-16 doesn't actually meet the C++ language's requirements for a wchar_t encoding, but some platforms use it regardless. This causes an issue with some standard APIs that are supposed to deal in codepoints but can't simply because a wchar_t that represents a UTF-16 code unit cannot represent all the platform's codepoints.


Here's an implementation that doesn't rely on platform specific details and requires nothing more than that wchar_t be large enough to hold UTF-16 code units and that each char holds exactly 8 bits of of a UTF-16 code units. It doesn't actually validate the UTF-16 data though.

#include <string>
#include <cassert>

#include <iterator>
#include <algorithm>
#include <iostream>

enum class endian {
    big,little,unknown
};

std::wstring deserialize_utf16be(std::string const &s) {
    assert(0==s.size()%2);

    std::wstring ws;
    for(size_t i=0;i<s.size();++i)
        if(i%2)
            ws.back() = ws.back() | ((unsigned char)s[i] & 0xFF);
        else
            ws.push_back(((unsigned char)s[i]  & 0xFF) << 8);
    return ws;
}

std::wstring deserialize_utf16le(std::string const &s) {
    assert(0==s.size()%2);

    std::wstring ws;
    for(size_t i=0;i<s.size();++i)
        if(i%2)
            ws.back() = ws.back() | (((unsigned char)s[i] & 0xFF) << 8);
        else
            ws.push_back((unsigned char)s[i] & 0xFF);
    return ws;
}

std::wstring deserialize_utf16(std::string s, endian e=endian::unknown) {
    static_assert(std::numeric_limits<wchar_t>::max() >= 0xFFFF,"wchar_t must be large enough to hold UTF-16 code units");
    static_assert(CHAR_BIT>=8,"char must hold 8 bits of UTF-16 code units");
    assert(0==s.size()%2);

    if(endian::big == e)
        return deserialize_utf16be(s);
    if(endian::little == e)
        return deserialize_utf16le(s);

    if(2<=s.size() && ((unsigned char)s[0])==0xFF && ((unsigned char)s[1])==0xFE)
        return deserialize_utf16le(s.substr(2));
    if(2<=s.size() && ((unsigned char)s[0])==0xfe && ((unsigned char)s[1])==0xff)
        return deserialize_utf16be(s.substr(2));

    return deserialize_utf16be(s);
}


int main() {
    char c[] = "\xFF\xFE\x61\0b\0\x3d\xd8\x7f\xdc";
    std::string buf(std::begin(c),std::end(c)-1);
    std::wstring utf16 = deserialize_utf16(buf);
    std::cout << std::hex;
    std::copy(begin(utf16),end(utf16),std::ostream_iterator<int>(std::cout," "));
    std::cout << "\n";
}

Upvotes: 0

Cheers and hth. - Alf
Cheers and hth. - Alf

Reputation: 145279

It there is a BOM (Byte Order Mark) at the start then you check that to determine the byte order. Otherwise it's best if you know the byte order, i.e., does least significant or most significant byte come first. If you don't know the byte order and have no BOM, then you just have to try one or both and apply some statistical test and/or involve a Human Decision Maker (HDM).

Let's say that this Little Endian byte order, i.e. least significant byte first.

Then for each pair of bytes do e.g.

w.push_back( (UnsignedChar( s[2*i + 1] ) << 8u) | UnsignedChar( s[2*i] ) );

where w is a std::wstring, i is an index of wide chars < s.length()/2, UnsignedChar is a typedef of unsigned char, s is a std::string holding the data, and 8 is the number of bits per byte, i.e. you have to assume or statically assert that CHAR_BITS from the <limits.h> header is 8.

Upvotes: 1

Mark Ransom
Mark Ransom

Reputation: 308206

std::wstring PackUTF16(const std::string & input)
{
    if (input.size() % 2 != 0)
        throw std::invalid_argument("input length must be even");
    std::wstring result(input.size() / 2, 0);
    for (int i = 0;  i < result.size();  ++i)
    {
        result[i] = (input[2*i+1] & 0xff) << 8 | (input[2*i] & 0xff); // for little endian
        //result[i] = (input[2*i] & 0xff) << 8 | (input[2*i+1] & 0xff); // for big endian
    }
    return result;
}

Upvotes: 2

FailedDev
FailedDev

Reputation: 26930

Try this one :

static inline std::wstring charToWide(const std::string & s_in)
{
    const char * cs = s_in.c_str();
    size_t aSize;
    if( ::mbsrtowcs_s(&aSize, NULL, 0, &cs, 0, NULL) != 0)
    {
      throw std::exception("Cannot convert string");
    }  
    std::vector<wchar_t> aBuffer(aSize);
    size_t aSizeSec;
    if (::mbstowcs_s(&aSizeSec, &aBuffer[0], aSize, cs, aSize) != 0)
    {
      throw std::exception("Cannot convert string");
    } 
    return std::wstring(&aBuffer[0], aSize - 1);
}

Upvotes: 1

Related Questions