BackSlash
BackSlash

Reputation: 22243

Cannot get URL content as UTF-8

i'm trying to read content from a URL but it does return strange symbols instead of "è", "à", etc.

This is the code i'm using:

public static String getPageContent(String _url) {
    URL url;
    InputStream is = null;
    BufferedReader dis;
    String line;
    String text = "";
    try {
        url = new URL(_url);
        is = url.openStream();

        //This line should open the stream as UTF-8
        dis = new BufferedReader(new InputStreamReader(is, "UTF-8"));

        while ((line = dis.readLine()) != null) {
            text += line + "\n";
        }
    } catch (MalformedURLException mue) {
        mue.printStackTrace();
    } catch (IOException ioe) {
        ioe.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }
    return text;
}

I saw other questions like this, and all of them were answered like

Declare your inputstream as 
new InputStreamReader(is, "UTF-8")

But i can't get it to work.

For example, if my url content contains

è uno dei più

I get

è uno dei più

What am i missing?

Upvotes: 0

Views: 822

Answers (2)

Michael-O
Michael-O

Reputation: 18415

Judging by your example. You do receive a multibyte UTF-8 byte stream but your text editor reads in as ISO-8859-1. Tell your editor to read bytes as UTF-8!

Upvotes: 1

uberwach
uberwach

Reputation: 1109

I don't really know why this should not work, however the Java 7 way would be to use StandardCharsets.UTF_8 see

http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html

in the (new) Constructor InputStreamReader(InputStream in, Charset cs), see

http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html.

Upvotes: 0

Related Questions