Reputation: 992
I'm trying to parse XML data from a URL but I cannot seem to get it to parse it as UTF-8 as the ¥
character gets messed up when reading it from the response:
URL url = new URL("https://suggestqueries.google.com/complete/search?output=toolbar&hl=en&q=¥");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
final InputStream in = url.openStream();
final InputSource source = new InputSource(new InputStreamReader(in, "UTF-8"));
source.setEncoding("UTF-8");
Document doc = db.parse(source);
doc.getDocumentElement().normalize();
NodeList nodeList = doc.getElementsByTagName("suggestion");
for (int i = 0; i < 10; i++) {
Node node = nodeList.item(i);
if(node==null || listItems.size() > 10){
break;
}
String suggestion = node.getAttributes().getNamedItem("data").getTextContent();
// ...suggestions include � instead of ¥
}
source.setEncoding()
was an accepted answer in another thread, but didn't seem to work for me.
Upvotes: 0
Views: 1265
Reputation: 407
Seems that the encoding of input file is different than UTF-8.
These works for me:
Read the document with ISO-8859-1 encoding
Document doc = db.parse(new InputSource(new InputStreamReader(url.openStream(), "ISO-8859-1")));
The final method is like:
URL url = new URL("https://suggestqueries.google.com/complete/search?output=toolbar&hl=en&q=¥");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new InputStreamReader(url.openStream(), "ISO-8859-1")));
doc.getDocumentElement().normalize();
NodeList nodeList = doc.getElementsByTagName("suggestion");
for (int i = 0; i < 10; i++) {
Node node = nodeList.item(i);
if(node==null){
break;
}
String suggestion = node.getAttributes().getNamedItem("data").getTextContent();
System.out.println(suggestion);
}
Upvotes: 2