Wissam Al-Wakeel
Wissam Al-Wakeel

Reputation: 14

Java Auto detect encoding an http response body

I have program that downloads webpages and process the body, and I am having problem detecting the encoding for some pages, especially if there is no information added in the header or in the html content, is there a way in java to auto detect and evaluate the char encoding of String or html body of a response?

Upvotes: 0

Views: 917

Answers (2)

tur11ng
tur11ng

Reputation: 1112

As an alternative answer I would suggest: URLConnection.guessContentTypeFromStream(InputStream is) but the Stream must support marking, and guessContentTypeFromName(String fname) (yes, I know it sounds silly, but it is very efficient).

Of course, first you have to get the Stream for the body of the HttpURLConnection somewhat like this InputStream is = response.getInputStream();

Upvotes: 0

Dhruvan Ganesh
Dhruvan Ganesh

Reputation: 1560

Have a look at juniversalchardet, which is the Java port of encoding detector library of Mozilla.

Here is a sample program to check if the encoding is UTF-8.

protected static boolean validUTF8(byte[] input) { 
  UniversalDetector detector = new UniversalDetector(null); 
  detector.handleData(input, 0, input.length); 
  detector.dataEnd(); 
  if ("UTF-8".equals(detector.getDetectedCharset())) { 
   return true; 
  } 
  return false; 
 } 

Upvotes: 1

Related Questions