juwalter
juwalter

Reputation: 11552

How to correctly use codecvt_byname (C++17) to encode latin1, and then UTF-8 for use in JSON

I am (desperately) trying to prepare a byte array (copied from a PLC, where they construct the "string" as a byte array, locale/encoding is German, French, etc) for use in nlohmann::json, while preserving the source encoding (latin1).

Using this toy example, the compiler complains about ~codecvt() and ~codecvt_byname() being protected:

/usr/bin/g++   -O3 -DNDEBUG -std=c++17 -MD -MT CMakeFiles/encod.dir/src/encod.cpp.o -MF CMakeFiles/encod.dir/src/encod.cpp.o.d -o CMakeFiles/encod.dir/src/encod.cpp.o -c /src/encod.cpp
In file included from /usr/include/c++/12/locale:43,
                 from /src/encod.cpp:1:
/usr/include/c++/12/bits/locale_conv.h: In instantiation of ‘std::__detail::_Scoped_ptr<_Tp>::~_Scoped_ptr() [with _Tp = std::codecvt<wchar_t, char, __mbstate_t>]’:
/usr/include/c++/12/bits/locale_conv.h:309:7:   required from here
/usr/include/c++/12/bits/locale_conv.h:241:26: error: ‘virtual std::codecvt<wchar_t, char, __mbstate_t>::~codecvt()’ is protected within this context
  241 |         ~_Scoped_ptr() { delete _M_ptr; }
      |                          ^~~~~~~~~~~~~
In file included from /usr/include/c++/12/bits/locale_facets_nonio.h:2067,
                 from /usr/include/c++/12/locale:41:
/usr/include/c++/12/bits/codecvt.h:429:7: note: declared protected here
  429 |       ~codecvt();
      |       ^
In file included from /usr/include/c++/12/memory:76,
                 from /src/encod.cpp:6:
/usr/include/c++/12/bits/unique_ptr.h: In instantiation of ‘void std::default_delete<_Tp>::operator()(_Tp*) const [with _Tp = std::codecvt_byname<wchar_t, char, __mbstate_t>]’:
/usr/include/c++/12/bits/unique_ptr.h:396:17:   required from ‘std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = std::codecvt_byname<wchar_t, char, __mbstate_t>; _Dp = std::default_delete<std::codecvt_byname<wchar_t, char, __mbstate_t> >]’
/src/encod.cpp:18:152:   required from here
/usr/include/c++/12/bits/unique_ptr.h:95:9: error: ‘std::codecvt_byname<_InternT, _ExternT, _StateT>::~codecvt_byname() [with _InternT = wchar_t; _ExternT = char; _StateT = __mbstate_t]’ is protected within this context
   95 |         delete __ptr;
      |         ^~~~~~~~~~~~
/usr/include/c++/12/bits/codecvt.h:722:7: note: declared protected here
  722 |       ~codecvt_byname() { }
      |       ^
#include <locale>
#include <codecvt>
#include <vector>
#include <string>
#include <iostream>
#include <memory>

int main() {
    std::vector<uint8_t> v = {0x68, 0xe4, 0x6c, 0x6c, 0x6f}; // hällo

    std::string my_string(v.begin(), v.end());

    // Convert to wide string
    std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;
    std::wstring wide_str = utf8_conv.from_bytes(my_string);

    // Convert wide string to Latin1 string
    std::unique_ptr<std::codecvt_byname<wchar_t, char, std::mbstate_t>> 
            latin1_cvt(new std::codecvt_byname<wchar_t, char, std::mbstate_t>("iso-8859-1"));
    std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> latin1_conv(latin1_cvt.get());
    std::string latin1_str = latin1_conv.to_bytes(wide_str);


    std::cout << latin1_str << std::endl;

    return 0;
}

How can I make this work? Should I better use ICU for this scenario, ie am I holding (using) it wrong?

Upvotes: 2

Views: 903

Answers (1)

Remy Lebeau
Remy Lebeau

Reputation: 595339

Note that most of the std::codecvt_... types are deprecated, so you should not be using them anymore. However, they do still work for existing implementations.

That said, you are simply using std::codecvt_byname wrong, which is why you are getting the compiler error.

Unlike the std::codecvt_utf... classes, which are meant to be usable by themselves and thus have public destructors, std::codecvt_byname is a locale-managed facet and so it has a protected destructor, which means you cannot destroy a std::codecvt_byname object directly. Locale-managed facets are owned by std::locale, and it will destroy any facet that is assigned to it. This is mentioned in the ~codecvt documentation on cppreference.com:

https://en.cppreference.com/w/cpp/locale/codecvt/%7Ecodecvt

Destructs a std::codecvt facet. This destructor is protected and virtual (due to base class destructor being virtual). An object of type std::codecvt, like most facets, can only be destroyed when the last std::locale object that implements this facet goes out of scope or if a user-defined class is derived from std::codecvt and implements a public destructor.

Which means, you can't use std::codecvt_byname as the direct type held by a std::unique_ptr. But, as mentioned above, you can derive a new class from std::codecvt_byname and give it a public destructor. This is even demonstrated in the std::wstring_convert documentation on cppreference.com:

https://en.cppreference.com/w/cpp/locale/wstring_convert/wstring_convert

#include <locale>
#include <utility>
#include <codecvt>
 
// utility wrapper to adapt locale-bound facets for wstring/wbuffer convert
template<class Facet>
struct deletable_facet : Facet
{
    using Facet::Facet; // inherit constructors
    ~deletable_facet() {}
};
 
int main()
{
    // UTF-16le / UCS4 conversion
    std::wstring_convert<
         std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>
    > u16to32;
 
    // UTF-8 / wide string conversion with custom messages
    std::wstring_convert<std::codecvt_utf8<wchar_t>> u8towide("Error!", L"Error!");

    // GB18030 / wide string conversion facet
    typedef deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>> F;
    std::wstring_convert<F> gbtowide(new F("zh_CN.gb18030"));
}

https://en.cppreference.com/w/cpp/locale/wstring_convert/%7Ewstring_convert

#include <locale>
#include <utility>
#include <codecvt>
 
// utility wrapper to adapt locale-bound facets for wstring/wbuffer convert
template<class Facet>
struct deletable_facet : Facet
{
    template<class ...Args>
    deletable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
    ~deletable_facet() {}
};
 
int main()
{
    // GB18030 / UCS4 conversion, using locale-based facet directly
    // typedef std::codecvt_byname<char32_t, char, std::mbstate_t> gbfacet_t;
    // Compiler error: "calling a protected destructor of codecvt_byname<> in ~wstring_convert"
    // std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030"));

    // GB18030 / UCS4 conversion facet using a facet with public destructor
    typedef deletable_facet<std::codecvt_byname<char32_t, char, std::mbstate_t>> gbfacet_t;
    std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030"));
} // destructor called here

Note the use of deletable_facet<std::codecvt_byname<...>> in both examples.

Also, note that std::wstring_convert takes ownership of the conversion facet that you give it, so you cannot use std::unique_ptr to manage its lifetime.

Thus, in your example, use this instead:

// Convert wide string to Latin1 string
using latin1_cvt = deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;
std::wstring_convert<latin1_cvt> latin1_conv(new latin1_cvt("iso-8859-1"));
std::string latin1_str = latin1_conv.to_bytes(wide_str);

Upvotes: 2

Related Questions