Unable to parse trademark symbol in XML using XPath

Question

Im trying to parse an XML file in Java and some lines contains an HTML symbol & #153; Still, when I do

((String) myXPath.evaluate(node, STRING));

I get square symbol instead of ™. My machine is Linux and XML encoding is UTF-8. I can't understand how to properly encode this exact symbol. & #8482; is encoded perfectly well...

I create a Document instance in a following way:

File xmlFile = new File(path);
FileInputStream fileIS = new FileInputStream(xmlFile);
xmlDocument = builder.parse(fileIS);

Michael Kay · Accepted Answer

The HTML entity & # 153 represents the character with Unicode codepoint 153, which is some unprintable control character. It isn't a trademark symbol. 153 might be a trademark symbol in some Microsoft Windows character set, but that's irrelevant on the web. You need to use the Unicode codepoint which is 8482 - https://en.wikipedia.org/wiki/Trademark_symbol

Note that the numbers used in HTML entity references have nothing to do with the file encoding. In fact, that's the whole point of using them - they survive changes of encoding.

Unable to parse trademark symbol in XML using XPath

Answers (1)

Related Questions