Saurav Sahu
Saurav Sahu

Reputation: 13934

Replace invalid XML unicode sequence in a string C++

Looking for a function in C++ which is a counterpart to Character.isIdentifierIgnorable() in Java. Basically I have to replace them with another string derived from them (so that information is not lost).

My implementation in Java:

public static String replaceInvalidChar (String s) {
     StringBuffer sb = new StringBuffer();

     char[] characters = s.toCharArray();

     for (char c : characters) {
             if (Character.isIdentifierIgnorable(c)){
                     sb.append(String.format("\\u%04x", (int)c));
             } else {
                     sb.append(c);
             }
     }

     return sb.toString();
}

Aiming to do the same in C++, but in order to replace the character I need to detect them first. Can someone help me in this?

Upvotes: 0

Views: 832

Answers (2)

Remy Lebeau
Remy Lebeau

Reputation: 596497

Per Java's Character.isIdentifierIgnorable(char) documentation:

Determines if the specified character should be regarded as an ignorable character in a Java identifier or a Unicode identifier.

The following Unicode characters are ignorable in a Java identifier or a Unicode identifier:

  • ISO control characters that are not whitespace

    • '\u0000' through '\u0008'
    • '\u000E' through '\u001B'
    • '\u007F' through '\u009F'
  • all characters that have the FORMAT general category value

Note: This method cannot handle supplementary characters. To support all Unicode characters, including supplementary characters, use the isIdentifierIgnorable(int) method

Parameters:
ch - the character to be tested.

Returns:
true if the character is an ignorable control character that may be part of a Java or Unicode identifier; false otherwise.

So, try something like this:

#include <string>
#include <sstream>
#include <iomanip>

bool isFormatChar(wchar_t ch)
{
    switch (ch)
    {
        case 0x00AD:

        case 0x2028:
        case 0x2029:

        case 0x061C:
        case 0x200E:
        case 0x200F:
        case 0x202A:
        case 0x202B:
        case 0x202C:
        case 0x202D:
        case 0x202E:
        case 0x2066:
        case 0x2067:
        case 0x2068:
        case 0x2069:

            // and many many more! For the full list of Format chars, see:
            // http://www.fileformat.info/info/unicode/category/Cf/list.htm
            // http://www.fileformat.info/info/unicode/category/Zl/list.htm
            // http://www.fileformat.info/info/unicode/category/Zp/list.htm

            return true;
    }

    return false;
}

std::wstring replaceInvalidChar(const std::wstring &s)
{
    std::wostringstream sb;

    for (auto ch: s)
    {
        if (((ch >= 0x0000) && (ch <= 0x0008)) ||
            ((ch >= 0x000E) && (ch <= 0x001B)) ||
            ((ch >= 0x007F) && (ch <= 0x009F)) ||
            isFormatChar(ch))
        {
            sb << L"\\u" << std::hex << std::nouppercase << std::setw(4) << std::setfill(L'0') << int(ch);
        } 
        else
        {
            sb.put(ch);
        }
    }

    return sb.str();
}

Upvotes: 0

Galik
Galik

Reputation: 48625

From what I can gather about how Character.isIdentifierIgnorable() works something along these lines may work for you:

std::wstring replaceInvalidChar(std::wstring const& s)
{
    std::wostringstream sb;

    for(auto c: s)
    {
        if(std::iswcntrl(c) && !std::iswspace(c))
            sb << L"\\u" << std::hex << std::setw(4) << std::setfill(L'0') << int(c);
        else
            sb << wchar_t(c);
    }

    return sb.str();
}

Upvotes: 1

Related Questions