Reputation: 13934
Looking for a function in C++ which is a counterpart to Character.isIdentifierIgnorable()
in Java. Basically I have to replace them with another string derived from them (so that information is not lost).
My implementation in Java:
public static String replaceInvalidChar (String s) {
StringBuffer sb = new StringBuffer();
char[] characters = s.toCharArray();
for (char c : characters) {
if (Character.isIdentifierIgnorable(c)){
sb.append(String.format("\\u%04x", (int)c));
} else {
sb.append(c);
}
}
return sb.toString();
}
Aiming to do the same in C++, but in order to replace the character I need to detect them first. Can someone help me in this?
Upvotes: 0
Views: 832
Reputation: 596497
Per Java's Character.isIdentifierIgnorable(char)
documentation:
Determines if the specified character should be regarded as an ignorable character in a Java identifier or a Unicode identifier.
The following Unicode characters are ignorable in a Java identifier or a Unicode identifier:
ISO control characters that are not whitespace
- '\u0000' through '\u0008'
- '\u000E' through '\u001B'
- '\u007F' through '\u009F'
all characters that have the FORMAT general category value
Note: This method cannot handle supplementary characters. To support all Unicode characters, including supplementary characters, use the
isIdentifierIgnorable(int)
methodParameters:
ch - the character to be tested.Returns:
true
if the character is an ignorable control character that may be part of a Java or Unicode identifier;false
otherwise.
So, try something like this:
#include <string>
#include <sstream>
#include <iomanip>
bool isFormatChar(wchar_t ch)
{
switch (ch)
{
case 0x00AD:
case 0x2028:
case 0x2029:
case 0x061C:
case 0x200E:
case 0x200F:
case 0x202A:
case 0x202B:
case 0x202C:
case 0x202D:
case 0x202E:
case 0x2066:
case 0x2067:
case 0x2068:
case 0x2069:
// and many many more! For the full list of Format chars, see:
// http://www.fileformat.info/info/unicode/category/Cf/list.htm
// http://www.fileformat.info/info/unicode/category/Zl/list.htm
// http://www.fileformat.info/info/unicode/category/Zp/list.htm
return true;
}
return false;
}
std::wstring replaceInvalidChar(const std::wstring &s)
{
std::wostringstream sb;
for (auto ch: s)
{
if (((ch >= 0x0000) && (ch <= 0x0008)) ||
((ch >= 0x000E) && (ch <= 0x001B)) ||
((ch >= 0x007F) && (ch <= 0x009F)) ||
isFormatChar(ch))
{
sb << L"\\u" << std::hex << std::nouppercase << std::setw(4) << std::setfill(L'0') << int(ch);
}
else
{
sb.put(ch);
}
}
return sb.str();
}
Upvotes: 0
Reputation: 48625
From what I can gather about how Character.isIdentifierIgnorable()
works something along these lines may work for you:
std::wstring replaceInvalidChar(std::wstring const& s)
{
std::wostringstream sb;
for(auto c: s)
{
if(std::iswcntrl(c) && !std::iswspace(c))
sb << L"\\u" << std::hex << std::setw(4) << std::setfill(L'0') << int(c);
else
sb << wchar_t(c);
}
return sb.str();
}
Upvotes: 1