Markus
Markus

Reputation: 35

Escaped Characters with XMLStreamReader

I'm using the XMLStreamReader to read a xml file.

The file contains escaped cahrecters in the following form: ü.

In my opinion the two escaped charecters should represents the char "ü" (UTF-8 encoded?)

But the xml Stream reader creates the following string: ã¼

Did I make something wrong during the creation of the reader?

Reader inputReader = Files.newBufferedReader(this.xmlFile.toPath(), StandardCharsets.UTF_8);
XMLInputFactory fact = XMLInputFactory.newInstance();
fact.setProperty("javax.xml.stream.isCoalescing", true);
XMLStreamReader parser = fact.createXMLStreamReader(inputReader);

Upvotes: 1

Views: 1425

Answers (1)

Ian Roberts
Ian Roberts

Reputation: 122364

Did I make something wrong during the creation of the reader?

No, the mistake was made by whoever created the file in the first place. A character reference represents one Unicode code point, so if you want to represent ü as a character reference it should be ü or ü. What appears to have happened here is that whoever created the file has mixed up their encodings somehow, and treated each byte in the UTF-8 encoding of U+00FC as a separate character, and serialized each of those characters as a character reference.

If you can't get the file corrected at source then you'll have to fix it up post-hoc yourself. If the mis-encoding in this file has been applied consistently then the XMLStreamReader will give you a Java string containing char values that are all <= 255. Since Unicode characters 0-255 are the same as ISO-8859-1, encoding this string as ISO-8859-1 will give you a byte[] consisting of the same byte values, which you can then decode as UTF-8 to obtain the proper string:

String correctString = new String(mangledString.getBytes("ISO-8859-1"), "UTF-8");

Upvotes: 4

Related Questions