Reputation: 543
I'm going to keep this question short and sweet. I have a function that takes a URL to read as a string and returns a string of the HTML source of a webpage. Here it is:
public static String getHTML(String urlToRead) throws Exception // Returns the source code of a given URL.
{
StringBuilder result = new StringBuilder();
URL url = new URL(urlToRead);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36");
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), Charset.forName("UTF-8")));
String line;
while ((line = rd.readLine()) != null)
{
result.append(line + System.getProperty("line.separator"));
}
rd.close();
result.toString();
}
It works like a charm, with the exception of one tiny quirk. Certain characters are not being read correctly by the InputStreamReader. The "ł" character isn't correctly read, and is instead replaced by a "?". That's the only character I've found thus far that follows this behaviour but there's no telling what other characters aren't being read correctly.
It seems like an issue with the character set. I'm using UTF-8 as you can see from the code. All other character sets I've tried using in its place have either outright not worked or have had trouble with far more than just one character.
What kind of thing could be causing this issue? Any help would be greatly appreciated!
Upvotes: 0
Views: 973
Reputation: 8397
You should use the same charset as the resource you read. First make sure what is the encoding used by that HTML. Usually its content type is sent in response header. You can easily get this information using any web browser with network tracking (since you have GET request).
For example using Chrome - open empty tab, open dev tools (F12), and load desired web page. Than you can look at network tab in dev tools and examine response headers.
Upvotes: 0
Reputation: 405
Have you tried :
conn.setRequestProperty("content-type", "text/plain; charset=utf-8");
Upvotes: 1