Reputation: 35
I'm using the XMLStreamReader to read a xml file.
The file contains escaped cahrecters in the following form: ü
.
In my opinion the two escaped charecters should represents the char "ü" (UTF-8 encoded?)
But the xml Stream reader creates the following string: ã¼
Did I make something wrong during the creation of the reader?
Reader inputReader = Files.newBufferedReader(this.xmlFile.toPath(), StandardCharsets.UTF_8);
XMLInputFactory fact = XMLInputFactory.newInstance();
fact.setProperty("javax.xml.stream.isCoalescing", true);
XMLStreamReader parser = fact.createXMLStreamReader(inputReader);
Upvotes: 1
Views: 1425
Reputation: 122364
Did I make something wrong during the creation of the reader?
No, the mistake was made by whoever created the file in the first place. A character reference represents one Unicode code point, so if you want to represent ü as a character reference it should be ü
or ü
. What appears to have happened here is that whoever created the file has mixed up their encodings somehow, and treated each byte in the UTF-8 encoding of U+00FC as a separate character, and serialized each of those characters as a character reference.
If you can't get the file corrected at source then you'll have to fix it up post-hoc yourself. If the mis-encoding in this file has been applied consistently then the XMLStreamReader will give you a Java string containing char
values that are all <= 255. Since Unicode characters 0-255 are the same as ISO-8859-1, encoding this string as ISO-8859-1 will give you a byte[]
consisting of the same byte values, which you can then decode as UTF-8 to obtain the proper string:
String correctString = new String(mangledString.getBytes("ISO-8859-1"), "UTF-8");
Upvotes: 4