Reputation: 3713
I'm writing a small crawler for sites in English only, and doing that by opening a URL
connection. I set the encoding to utf-8
both on the request, and the InputStreamReader
but I continue to get gobbledigook for some of the requests, while others work fine.
The following code represents all the research I did and advice out there. I have also tried changing URLConnection
to HttpURLConnection
with no luck. Some of the returned strings continue to look like this:
??}?r?H????P?n?c??]?d?G?o??Xj{?x?"P$a?Qt?#&??e?a#?????lfVx)?='b?"Y(defUeefee=??????.??a8??{O??????zY?2?M???3c??@
What am I missing?
My code:
public static String getDocumentFromUrl(String urlString) throws Exception {
String wholeDocument = null;
URL url = new URL(urlString);
URLConnection conn = url.openConnection();
conn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
conn.setRequestProperty("Accept-Charset", "utf-8");
conn.setConnectTimeout(60*1000); // wait only 60 seconds for a response
conn.setReadTimeout(60*1000);
InputStreamReader isr = new InputStreamReader(conn.getInputStream(), "utf-8");
BufferedReader in = new BufferedReader(isr);
String inputLine;
while ((inputLine = in.readLine()) != null) {
wholeDocument += inputLine;
}
isr.close();
in.close();
return wholeDocument;
}
Upvotes: 1
Views: 5142
Reputation: 111359
The server is sending the document GZIP compressed. You can set the Accept-Encoding
HTTP header to make it send the document in plain text.
conn.setRequestProperty("Accept-Encoding", "identity");
Even so, the HTTP client class handles GZIP compression for you, so you shouldn't have to worry about details like this. What seems to be going on here is that the server is buggy: it does not send the Content-Encoding
header to tell you the content is compressed. This behavior seems to depend on the User-Agent
, so that the site works in regular web browsers but breaks when used from Java. So, setting the user agent also fixes the issue:
conn.setRequestProperty("User-Agent", "Mozilla/5.0"); // for example
Upvotes: 3