user3252635
user3252635

Reputation: 125

How do I convert a string in UTF-16 to UTF-8 in C++

Consider:

STDMETHODIMP CFileSystemAPI::setRRConfig( BSTR config_str, VARIANT* ret )
{
mReportReaderFactory.reset( new sbis::report_reader::ReportReaderFactory() );

USES_CONVERSION;
std::string configuration_str = W2A( config_str );

But in config_str I get a string in UTF-16. How can I convert it to UTF-8 in this piece of code?

Upvotes: 7

Views: 27033

Answers (4)

Arty
Arty

Reputation: 16737

I implemented two variants of conversion between UTF-8<->UTF-16<->UTF-32, first variant fully implements all conversions from scratch, second uses standard std::codecvt and std::wstring_convert (these two classes are deprecated starting from C++17, but still exist, also guaranteed to exist in C++11/C++14).

If you don't like my code then you may use almost-single-header C++ library utfcpp, which should be very well tested by many customers.

To convert UTF-8 to UTF-16 just call Utf32To16(Utf8To32(str)) and to convert UTF-16 to UTF-8 call Utf32To8(Utf16To32(str)). Or you may just use my handy helper function UtfConv<std::wstring>(std::string("abc")) for UTF-8 to UTF-16 or UtfConv<std::string>(std::wstring(L"abc")) for UTF-16 to UTF-8, UtfConv actually can convert from any to any Utf-encoded string. See examples of these and other usages inside Test(cs) macro.

Both variants are C++11 compliant. Also they compile in CLang/GCC/MSVC compilers (see "Try it online!" links down below) and tested to work in Windows/Linux OSes.

You have to save both of my code snippets in file with UTF-8 encoding and provide options -finput-charset=UTF-8 -fexec-charset=UTF-8 to CLang/GCC, and options /utf-8 to MSVC. This utf-8 saving and options are needed only if you put literal strings with non-ascii characters, like I did in my code for testing only purposes. To use functions themselves you don't need this utf-8 saving and options.

Inclusions of <windows.h> and <clocale> and <iostream>, also call to SetConsoleOutputCP(65001) and std::setlocale(LC_ALL, "en_US.UTF-8") are needed only for testing purposes to setup and output correctly to UTF-8 console. These things are not needed for conversion functions.

Part of code is not very necessary, I mean UtfHelper-related structure and functions, they are just helper functions for conversion, mainly created to handle in cross-platform way std::wstring, because wchar_t is usually 32-bit on Linux and 16-bit on Windows. Only low-level functions Utf8To32, Utf32To8, Utf16To32, Utf32To16 are the only things that are really needed for conversion.

Variant 1 was created out of Wikipedia description of UTF-8 and UTF-16 encodings.

If you find bugs or any improvements (especially in Variant 1) please tell me, I'll fix them.


Variant 1

Try it online!

#include <string>
#include <iostream>
#include <stdexcept>
#include <type_traits>
#include <cstdint>

#ifdef _WIN32
    #include <windows.h>
#else
    #include <clocale>
#endif

#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg: " + std::string(msg)); }
#define ASSERT(cond) ASSERT_MSG(cond, "")

template <typename U8StrT = std::string>
inline static U8StrT Utf32To8(std::u32string const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    typedef typename U8StrT::value_type VT;
    typedef uint8_t u8;
    U8StrT r;
    for (auto c: s) {
        size_t nby = c <= 0x7FU ? 1 : c <= 0x7FFU ? 2 : c <= 0xFFFFU ? 3 : c <= 0x1FFFFFU ? 4 : c <= 0x3FFFFFFU ? 5 : c <= 0x7FFFFFFFU ? 6 : 7;
        r.push_back(VT(
            nby <= 1 ? u8(c) : (
                (u8(0xFFU) << (8 - nby)) |
                u8(c >> (6 * (nby - 1)))
            )
        ));
        for (size_t i = 1; i < nby; ++i)
            r.push_back(VT(u8(0x80U | (u8(0x3FU) & u8(c >> (6 * (nby - 1 - i)))))));
    }
    return r;
}

template <typename U8StrT>
inline static std::u32string Utf8To32(U8StrT const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    typedef uint8_t u8;
    std::u32string r;
    auto it = (u8 const *)s.c_str(), end = (u8 const *)(s.c_str() + s.length());
    while (it < end) {
        char32_t c = 0;
        if (*it <= 0x7FU) {
            c = *it;
            ++it;
        } else {
            ASSERT((*it & 0xC0U) == 0xC0U);
            size_t nby = 0;
            for (u8 b = *it; (b & 0x80U) != 0; b <<= 1, ++nby) {(void)0;}
            ASSERT(nby <= 7);
            ASSERT((end - it) >= nby);
            c = *it & (u8(0xFFU) >> (nby + 1));
            for (size_t i = 1; i < nby; ++i) {
                ASSERT((it[i] & 0xC0U) == 0x80U);
                c = (c << 6) | (it[i] & 0x3FU);
            }
            it += nby;
        }
        r.push_back(c);
    }
    return r;
}


template <typename U16StrT = std::u16string>
inline static U16StrT Utf32To16(std::u32string const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    typedef typename U16StrT::value_type VT;
    typedef uint16_t u16;
    U16StrT r;
    for (auto c: s) {
        if (c <= 0xFFFFU)
            r.push_back(VT(c));
        else {
            ASSERT(c <= 0x10FFFFU);
            c -= 0x10000U;
            r.push_back(VT(u16(0xD800U | ((c >> 10) & 0x3FFU))));
            r.push_back(VT(u16(0xDC00U | (c & 0x3FFU))));
        }
    }
    return r;
}

template <typename U16StrT>
inline static std::u32string Utf16To32(U16StrT const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    typedef uint16_t u16;
    std::u32string r;
    auto it = (u16 const *)s.c_str(), end = (u16 const *)(s.c_str() + s.length());
    while (it < end) {
        char32_t c = 0;
        if (*it < 0xD800U || *it > 0xDFFFU) {
            c = *it;
            ++it;
        } else if (*it >= 0xDC00U) {
            ASSERT_MSG(false, "Unallowed UTF-16 sequence!");
        } else {
            ASSERT(end - it >= 2);
            c = (*it & 0x3FFU) << 10;
            if ((it[1] < 0xDC00U) || (it[1] > 0xDFFFU)) {
                ASSERT_MSG(false, "Unallowed UTF-16 sequence!");
            } else {
                c |= it[1] & 0x3FFU;
                c += 0x10000U;
            }
            it += 2;
        }
        r.push_back(c);
    }
    return r;
}


template <typename StrT, size_t NumBytes = sizeof(typename StrT::value_type)> struct UtfHelper;
template <typename StrT> struct UtfHelper<StrT, 1> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf8To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To8<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 2> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf16To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To16<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 4> {
    inline static std::u32string UtfTo32(StrT const & s) {
        return std::u32string((char32_t const *)(s.c_str()), (char32_t const *)(s.c_str() + s.length()));
    }
    inline static StrT UtfFrom32(std::u32string const & s) {
        return StrT((typename StrT::value_type const *)(s.c_str()),
            (typename StrT::value_type const *)(s.c_str() + s.length()));
    }
};
template <typename StrT> inline static std::u32string UtfTo32(StrT const & s) {
    return UtfHelper<StrT>::UtfTo32(s);
}
template <typename StrT> inline static StrT UtfFrom32(std::u32string const & s) {
    return UtfHelper<StrT>::UtfFrom32(s);
}
template <typename StrToT, typename StrFromT> inline static StrToT UtfConv(StrFromT const & s) {
    return UtfFrom32<StrToT>(UtfTo32(s));
}

#define Test(cs) \
    std::cout << Utf32To8(Utf8To32(std::string(cs))) << ", "; \
    std::cout << Utf32To8(Utf16To32(Utf32To16(Utf8To32(std::string(cs))))) << ", "; \
    std::cout << Utf32To8(Utf16To32(std::u16string(u##cs))) << ", "; \
    std::cout << Utf32To8(std::u32string(U##cs)) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::u16string>(UtfConv<std::u32string>(UtfConv<std::u32string>(UtfConv<std::u16string>(std::string(cs)))))) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::wstring>(UtfConv<std::string>(UtfConv<std::u32string>(UtfConv<std::u32string>(std::string(cs)))))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::string(cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u16string(u##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::wstring(L##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u32string(U##cs))) << std::endl; \
    std::cout << "UTF-8 num bytes: " << std::dec << Utf32To8(std::u32string(U##cs)).size() << ", "; \
    std::cout << "UTF-16 num bytes: " << std::dec << (Utf32To16(std::u32string(U##cs)).size() * 2) << std::endl;

int main() {
    #ifdef _WIN32
        SetConsoleOutputCP(65001);
    #else
        std::setlocale(LC_ALL, "en_US.UTF-8");
    #endif
    try {
        Test("World");
        Test("Привет");
        Test("𐐷𤭢");
        Test("𝞹");
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

World, World, World, World, World, World, World, World, World, World
UTF-8 num bytes: 5, UTF-16 num bytes: 10
Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет
UTF-8 num bytes: 12, UTF-16 num bytes: 12
𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢
UTF-8 num bytes: 8, UTF-16 num bytes: 8
𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹
UTF-8 num bytes: 4, UTF-16 num bytes: 4

Variant 2

Try it online!

#include <string>
#include <iostream>
#include <stdexcept>
#include <type_traits>
#include <locale>
#include <codecvt>
#include <cstdint>

#ifdef _WIN32
    #include <windows.h>
#else
    #include <clocale>
#endif

#define ASSERT(cond) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "!"); }

// Workaround for some of MSVC compilers.
#if defined(_MSC_VER) && (!_DLL) && (_MSC_VER >= 1900 /* VS 2015*/) && (_MSC_VER <= 1914 /* VS 2017 */)
std::locale::id std::codecvt<char16_t, char, _Mbstatet>::id;
std::locale::id std::codecvt<char32_t, char, _Mbstatet>::id;
#endif

template <typename U8StrT>
inline static std::u32string Utf8To32(U8StrT const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv_;
    return utf_8_32_conv_.from_bytes((char const *)s.c_str(), (char const *)(s.c_str() + s.length()));
}

template <typename U8StrT = std::string>
inline static U8StrT Utf32To8(std::u32string const & s) {
    static_assert(sizeof(typename U8StrT::value_type) == 1, "Char byte-size should be 1 for UTF-8 strings!");
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf_8_32_conv_;
    std::string res = utf_8_32_conv_.to_bytes(s.c_str(), s.c_str() + s.length());
    return U8StrT(
        (typename U8StrT::value_type const *)(res.c_str()),
        (typename U8StrT::value_type const *)(res.c_str() + res.length()));
}

template <typename U16StrT>
inline static std::u32string Utf16To32(U16StrT const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>, char32_t> utf_16_32_conv_;
    return utf_16_32_conv_.from_bytes((char const *)s.c_str(), (char const *)(s.c_str() + s.length()));
}

template <typename U16StrT = std::u16string>
inline static U16StrT Utf32To16(std::u32string const & s) {
    static_assert(sizeof(typename U16StrT::value_type) == 2, "Char byte-size should be 2 for UTF-16 strings!");
    std::wstring_convert<std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>, char32_t> utf_16_32_conv_;
    std::string res = utf_16_32_conv_.to_bytes(s.c_str(), s.c_str() + s.length());
    return U16StrT(
        (typename U16StrT::value_type const *)(res.c_str()),
        (typename U16StrT::value_type const *)(res.c_str() + res.length()));
}


template <typename StrT, size_t NumBytes = sizeof(typename StrT::value_type)> struct UtfHelper;
template <typename StrT> struct UtfHelper<StrT, 1> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf8To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To8<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 2> {
    inline static std::u32string UtfTo32(StrT const & s) { return Utf16To32(s); }
    inline static StrT UtfFrom32(std::u32string const & s) { return Utf32To16<StrT>(s); }
};
template <typename StrT> struct UtfHelper<StrT, 4> {
    inline static std::u32string UtfTo32(StrT const & s) {
        return std::u32string((char32_t const *)(s.c_str()), (char32_t const *)(s.c_str() + s.length()));
    }
    inline static StrT UtfFrom32(std::u32string const & s) {
        return StrT((typename StrT::value_type const *)(s.c_str()),
            (typename StrT::value_type const *)(s.c_str() + s.length()));
    }
};
template <typename StrT> inline static std::u32string UtfTo32(StrT const & s) {
    return UtfHelper<StrT>::UtfTo32(s);
}
template <typename StrT> inline static StrT UtfFrom32(std::u32string const & s) {
    return UtfHelper<StrT>::UtfFrom32(s);
}
template <typename StrToT, typename StrFromT> inline static StrToT UtfConv(StrFromT const & s) {
    return UtfFrom32<StrToT>(UtfTo32(s));
}

#define Test(cs) \
    std::cout << Utf32To8(Utf8To32(std::string(cs))) << ", "; \
    std::cout << Utf32To8(Utf16To32(Utf32To16(Utf8To32(std::string(cs))))) << ", "; \
    std::cout << Utf32To8(Utf16To32(std::u16string(u##cs))) << ", "; \
    std::cout << Utf32To8(std::u32string(U##cs)) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::u16string>(UtfConv<std::u32string>(UtfConv<std::u32string>(UtfConv<std::u16string>(std::string(cs)))))) << ", "; \
    std::cout << UtfConv<std::string>(UtfConv<std::wstring>(UtfConv<std::string>(UtfConv<std::u32string>(UtfConv<std::u32string>(std::string(cs)))))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::string(cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u16string(u##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::wstring(L##cs))) << ", "; \
    std::cout << UtfFrom32<std::string>(UtfTo32(std::u32string(U##cs))) << std::endl; \
    std::cout << "UTF-8 num bytes: " << std::dec << Utf32To8(std::u32string(U##cs)).size() << ", "; \
    std::cout << "UTF-16 num bytes: " << std::dec << (Utf32To16(std::u32string(U##cs)).size() * 2) << std::endl;

int main() {
    #ifdef _WIN32
        SetConsoleOutputCP(65001);
    #else
        std::setlocale(LC_ALL, "en_US.UTF-8");
    #endif
    try {
        Test("World");
        Test("Привет");
        Test("𐐷𤭢");
        Test("𝞹");
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

World, World, World, World, World, World, World, World, World, World
UTF-8 num bytes: 5, UTF-16 num bytes: 10
Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет, Привет
UTF-8 num bytes: 12, UTF-16 num bytes: 12
𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢, 𐐷𤭢
UTF-8 num bytes: 8, UTF-16 num bytes: 8
𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹, 𝞹
UTF-8 num bytes: 4, UTF-16 num bytes: 4

Upvotes: 7

AndersK
AndersK

Reputation: 36082

You can do something like this

std::string WstrToUtf8Str(const std::wstring& wstr)
{
  std::string retStr;
  if (!wstr.empty())
  {
    int sizeRequired = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, NULL, 0, NULL, NULL);

    if (sizeRequired > 0)
    {
      std::vector<char> utf8String(sizeRequired);
      int bytesConverted = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(),    
                           -1, &utf8String[0], utf8String.size(), NULL, 
                           NULL);
      if (bytesConverted != 0)
      {
        retStr = &utf8String[0];
      }
      else
      {
        std::stringstream err;
        err << __FUNCTION__ 
            << " std::string WstrToUtf8Str failed to convert wstring '"
            << wstr.c_str() << L"'";
        throw std::runtime_error( err.str() );
      }
    }
  }
  return retStr;
}

You can give your BSTR to the function as a std::wstring

Upvotes: 6

beardedN5rd
beardedN5rd

Reputation: 185

If you are using C++11 you may check this out:

http://www.cplusplus.com/reference/codecvt/codecvt_utf8_utf16/

Upvotes: 2

kvv
kvv

Reputation: 356

void encode_unicode_character(char* buffer, int* offset, wchar_t ucs_character)
{
    if (ucs_character <= 0x7F)
    {
        // Plain single-byte ASCII.
        buffer[(*offset)++] = (char) ucs_character;
    }
    else if (ucs_character <= 0x7FF)
    {
        // Two bytes.
        buffer[(*offset)++] = 0xC0 | (ucs_character >> 6);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0xFFFF)
    {
        // Three bytes.
        buffer[(*offset)++] = 0xE0 | (ucs_character >> 12);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x1FFFFF)
    {
        // Four bytes.
        buffer[(*offset)++] = 0xF0 | (ucs_character >> 18);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x3FFFFFF)
    {
        // Five bytes.
        buffer[(*offset)++] = 0xF8 | (ucs_character >> 24);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 18) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else if (ucs_character <= 0x7FFFFFFF)
    {
        // Six bytes.
        buffer[(*offset)++] = 0xFC | (ucs_character >> 30);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 24) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 18) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 12) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 6) & 0x3F);
        buffer[(*offset)++] = 0x80 | ((ucs_character >> 0) & 0x3F);
    }
    else
    {
        // Invalid char; don't encode anything.
    }
}

ISO10646-2012 it is all you need to understand UCS.

Upvotes: 2

Related Questions