Reputation: 14
I have program that downloads webpages and process the body, and I am having problem detecting the encoding for some pages, especially if there is no information added in the header or in the html content, is there a way in java to auto detect and evaluate the char encoding of String or html body of a response?
Upvotes: 0
Views: 917
Reputation: 1112
As an alternative answer I would suggest:
URLConnection.guessContentTypeFromStream(InputStream is)
but the Stream must support marking, and
guessContentTypeFromName(String fname)
(yes, I know it sounds silly, but it is very efficient).
Of course, first you have to get the Stream for the body of the HttpURLConnection somewhat like this InputStream is = response.getInputStream();
Upvotes: 0
Reputation: 1560
Have a look at juniversalchardet, which is the Java port of encoding detector library of Mozilla.
Here is a sample program to check if the encoding is UTF-8.
protected static boolean validUTF8(byte[] input) {
UniversalDetector detector = new UniversalDetector(null);
detector.handleData(input, 0, input.length);
detector.dataEnd();
if ("UTF-8".equals(detector.getDetectedCharset())) {
return true;
}
return false;
}
Upvotes: 1