Reputation: 22243
i'm trying to read content from a URL but it does return strange symbols instead of "è", "à", etc.
This is the code i'm using:
public static String getPageContent(String _url) {
URL url;
InputStream is = null;
BufferedReader dis;
String line;
String text = "";
try {
url = new URL(_url);
is = url.openStream();
//This line should open the stream as UTF-8
dis = new BufferedReader(new InputStreamReader(is, "UTF-8"));
while ((line = dis.readLine()) != null) {
text += line + "\n";
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe) {
// nothing to see here
}
}
return text;
}
I saw other questions like this, and all of them were answered like
Declare your inputstream as
new InputStreamReader(is, "UTF-8")
But i can't get it to work.
For example, if my url content contains
è uno dei più
I get
è uno dei più
What am i missing?
Upvotes: 0
Views: 822
Reputation: 18415
Judging by your example. You do receive a multibyte UTF-8 byte stream but your text editor reads in as ISO-8859-1. Tell your editor to read bytes as UTF-8!
Upvotes: 1
Reputation: 1109
I don't really know why this should not work, however the Java 7 way would be to use StandardCharsets.UTF_8 see
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
in the (new) Constructor InputStreamReader(InputStream in, Charset cs), see
http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html.
Upvotes: 0