Reputation: 69342
I have the following Java code to fetch the entire contents of an HTML page at a given URL. Can this be done in a more efficient way? Any improvements are welcome.
public static String getHTML(final String url) throws IOException {
if (url == null || url.length() == 0) {
throw new IllegalArgumentException("url cannot be null or empty");
}
final HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
final BufferedReader buf = new BufferedReader(new InputStreamReader(conn.getInputStream()));
final StringBuilder page = new StringBuilder();
final String lineEnd = System.getProperty("line.separator");
String line;
try {
while (true) {
line = buf.readLine();
if (line == null) {
break;
}
page.append(line).append(lineEnd);
}
} finally {
buf.close();
}
return page.toString();
}
I can't help but feel that the line reading is less than optimal. I know that I'm possibly masking a MalformedURLException
caused by the openConnection
call, and I'm okay with that.
My function also has the side-effect of making the HTML String have the correct line terminators for the current system. This isn't a requirement.
I realize that network IO will probably dwarf the time it takes to read in the HTML, but I'd still like to know this is optimal.
On a side note: It would be awesome if StringBuilder
had a constructor for an open InputStream
that would simply take all the contents of the InputStream
and read it into the StringBuilder
.
Upvotes: 2
Views: 943
Reputation: 10948
As seen in the other answers, there are many different edge cases (HTTP peculiarities, encoding, chunking, etc) that should be accounted for in any robust solution. Therefore I propose that in anything other than a toy program you use the de facto Java standard HTTP library: Apache HTTP Components HTTP Client.
They provide many samples, "just" getting the response contents for a request looks like this:
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://www.google.com/");
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = httpclient.execute(httpget, responseHandler);
// responseBody now contains the contents of the page
System.out.println(responseBody);
httpclient.getConnectionManager().shutdown();
Upvotes: 10
Reputation: 24497
OK, edited once more. Be sure to put your try-finally blocks around it, or catch IOException
...
final static int BUFZ = 4096;
StringBuilder page = new StringBuilder();
HttpURLConnection conn =
(HttpURLConnection) new URL(url).openConnection();
InputStream is = conn.getInputStream()
// perhaps allocate this one time and reuse if you
//call this method a lot.
byte[] buf = new byte[BUFZ] ;
int nRead = 0;
while((nRead = is.read(buf, 0, BUFZ) > 0) {
page.append(new String(buf /* , Charset charset */));
// uses local default char encoding for now
}
Here try this one:
...
final static int MAX_SIZE = 10000000;
HttpURLConnection conn =
(HttpURLConnection) new URL(url).openConnection();
InputStream is = conn.getInputStream()
// perhaps allocate this one time and reuse if you
//call this method a lot.
byte[] buf = new byte[MAX_SIZE] ;
int nRead = 0;
int total = 0;
// you could also use ArrayList so that you could dynamically
// resize or there are other ways to resize an array also
while(total < MAX_SIZE && (nRead = is.read(buf) > 0) {
total += nRead;
}
...
// do something with buf array of length total
OK the code below was not working for you because the Content-length header line is not being sent in the beginning due to HTTP/1.1 "chunking"
...
HttpURLConnection conn =
(HttpURLConnection) new URL(url).openConnection();
InputStream is = conn.getInputStream()
int cLen = conn.getContentLength() ;
byte[] buf = new byte[cLen] ;
int nRead=0 ;
while(nRead < cLen) {
nRead += is.read(buf, nRead, cLen - nRead) ;
}
...
// do something with buf array
Upvotes: 2
Reputation: 56762
You could do your own buffering on top of InputStreamReader by reading bigger chunks into a character array and appending the array contents to the StringBuilder.
But it would make your code slightly harder to understand, and I doubt it would be worth it.
Note that the proposal by Sean A.O. Harney reads raw bytes, so you'd need to do the conversion to text on top of that.
Upvotes: 1