Mark
Mark

Reputation: 41085

Simplest way to correctly load html from web page into a string in Java

Just what the title says.

Help greatly appreciated!

Upvotes: 30

Views: 44087

Answers (3)

altumano
altumano

Reputation: 2735

You can still simplify it a bit using org.apache.commons.io.IOUtils:

URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and 
 * hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
String str = IOUtils.toString(con.getInputStream(), charset);

Upvotes: 4

erickson
erickson

Reputation: 269647

An extremely common error is the failure to correctly convert an HTTP response from bytes to characters. To do this, you have to know the character encoding of the response. Hopefully, this is specified as a parameter in the "Content-Type" parameter. But putting it in the body itself, as an "http-equiv" attribute in a meta tag is also an option.

So, it is surprisingly complicated to load a page into a String correctly, and even 3rd party libraries like HttpClient don't offer a general solution.

Here's a simple implementation that will handle the most common case:

URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and 
 * hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
Reader r = new InputStreamReader(con.getInputStream(), charset);
StringBuilder buf = new StringBuilder();
while (true) {
  int ch = r.read();
  if (ch < 0)
    break;
  buf.append((char) ch);
}
String str = buf.toString();

Upvotes: 33

OscarRyz
OscarRyz

Reputation: 199215

I use this:

        BufferedReader bufferedReader = new BufferedReader( 
                                     new InputStreamReader( 
                                          new URL(urlToSeach)
                                              .openConnection()
                                              .getInputStream() ));

        StringBuilder sb = new StringBuilder();
        String line = null;
        while( ( line = bufferedReader.readLine() ) != null ) {
             sb.append( line ) ;
             sb.append( "\n");
        }
        .... in finally.... 
        buffer.close();

It works most of the times.

Upvotes: 1

Related Questions