Ziagl
Ziagl

Reputation: 492

How to read UTF-8 XML with RapidXML and use it with C++

I work on a C++ application and implemented a translator class that uses data from a XML file to translate strings. I have now serious problems with special characters currently for German Umlauts ÖÄÜ for example...

On Visual Studio I see following in Debug view, it reads following sample string "Dateiäüö":

Debugview

Because of this problem I found this post on Stackoverflow: How to read Unicode XML values with rapidxml and changed my RapidXML class to work with wchar_t:

std::string RapidXMLParser::getValueUTF8(const std::string path)
{
    std::vector<std::string> tags = splitPath(path);
    rapidxml::file<wchar_t> xmlFile(filename.data());
    docUTF8.parse<0>(xmlFile.data());
    rapidxml::xml_node<wchar_t>* element = findElementUTF8(path);
    if (element)
    {
        std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
        std::wstring temp = element->value();
        std::string result = converter.to_bytes(temp);
        return result;
    }
    else
        return "";
}

But this did not solve my problem:

enter image description here

The source XML file was checked by an UTF-8 validator and it is ok. If I change encoding to ANSI, everything works on Windows (but this is no solution!). If I compile same code on Linux I get empty strings for ANSI encoded XML with Umlauts and a crash for UTF-8 encoded XML...

The program uses wxwidgets for its interface and there this menu item renders the same characters as displayed by debugger. On Linux the empty strings leads to missing menu items or empty lines.

I hope someone has a good advice how to solve this or a suggestion for an alternative way to do UTF-8 translation with editable data source like an XML file.

EDIT: My XML parser can switch between RapidXML and TinyXML. I've also tested this with TinyXML and I get the same problem:

enter image description here

Upvotes: 0

Views: 1199

Answers (2)

stefan.gal
stefan.gal

Reputation: 312

I think your findElementUTF8() should return a rapidxml::xml_node<char>*

rapidxml::xml_node<char>* element = findElementUTF8(path);

because UTF8 is usually represented by char*. Following code is working for both windows API and codecvt

  // äüö UTF8 encoded
  byte b8[] = { 0xc3, 0xa4, 0xc3, 0xbc, 0xc3, 0xb6, 0x00 };

  std::string sb8 = (char*)b8;

  wchar_t win_conv[16]{ 0 };
  MultiByteToWideChar(CP_UTF8, 0, (char*)b8, -1, win_conv, ARRAYSIZE(win_conv));

  std::wstring utf16_conv = std::wstring_convert<
    std::codecvt_utf8_utf16<wchar_t>>{}.from_bytes(sb8);

  assert(utf16_conv == win_conv);

Upvotes: 0

Ziagl
Ziagl

Reputation: 492

After hours working on this issue the solution was quite simple. Never trust your debugger!!! The problem was caused by wxwidgets... it displays the same chars as seen in my debugger, but if I put a utf8 to utf16 conversion before rendering of menu items, it displays the string correct!

The solution is to do not use shown codecvt code in my XMLParser, but instead in my wxwidgets code just before output. On Linux I have now the problem, that codecvt is not part of std for g++...but this is another story.

Oh boy...I hope this is useful if anybody has a similar problem.

Upvotes: 0

Related Questions