Johannes
Johannes

Reputation: 133

HTML content downloaded using URL.openStream() always contains invalid characters

I'm trying to download HTML code from YouTube in Java, but the resulting String always contains invalid characters. For example "ü" becomes "u?".
I've tried using all the usual encodings and even wrote a little test program that tries every encoding and every combination of encodings, but the invalid characters remain.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.charset.Charset;

public class EncodingTest {

    public static void main(final String[] args) throws MalformedURLException, IOException {
        for (final Charset a : Charset.availableCharsets().values()) {
            final BufferedReader in = new BufferedReader(new InputStreamReader(new URL("https://www.youtube.com/watch?v=WENkquBHchM").openStream(), a));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                if (inputLine.contains("og:title")) {

                    System.out.println(inputLine);

                    for (final Charset b : Charset.availableCharsets().values()) {
                        try {
                            System.out.println(new String(inputLine.getBytes(), b) + "\t[" + a + " -> " + b + "]");
                        }
                        catch (final Exception e) {
                        }
                    }
                }
            }
            in.close();
        }
    }
}

If I open the URL in a browser or download it using wget or something similar, there are no errors. I've also tried download HTML from some other sites, but it's working fine there.
Is there any way to fix this?

Upvotes: 0

Views: 189

Answers (2)

Johannes
Johannes

Reputation: 133

Turns out the problem was the encoding of my source files. Eclipse used "Cp1252" as the default. After switching to "UTF-8" everything is working fine.

Upvotes: 0

user3305829
user3305829

Reputation: 11

It's simple UTF-8 (as the response header says it in Chrome). Do not convert it back and forth. If it does not work, than the problem is that your consol can't display UTF-8 text.

Try this:

BufferedReader in = new BufferedReader(new InputStreamReader(new URL("https://...").openStream(), "UTF-8"));

Upvotes: 1

Related Questions