Reputation: 1241
I am writing a simple web crawler in Java. I want it to be able to download as many pages per second as possible. Is there a package out there that makes doing asynchronous HTTP web requests easy in Java? I have used the HttpURLConnection but that is blocking. I also know there is something in Apache's HTTPCore NIO, but I am looking for something more lightweight. I tried using this package and I was getting better throughput using the HttpURLConnection on multiple threads.
Upvotes: 3
Views: 6685
Reputation: 27538
Generally data intensive protocols tend to perform better in terms of a raw throughput with the classic blocking I/O compared than NIO as long as the number of threads is below 1000. At least that is certainly the case with the client side HTTP based on (likely imperfect and possibly biased) HTTP benchmark used by Apache HttpClient [1]
One may be much better off using a blocking HTTP client with threads as long as the number of threads is moderate (<250)
If you are absolutely sure you want a NIO based HTTP client I can recommend Jetty HTTP client which I personally consider the best asynchronous HTTP client at the moment.
[1] http://wiki.apache.org/HttpComponents/HttpClient3vsHttpClient4vsHttpCore
Upvotes: 6
Reputation: 7387
While this user wasn't asking the same question, you may find answers to his question useful: Asynchronous HTTP Client for Java
As a side-note, if you're going to download "as many pages per second as possible", you should bear in mind that crawlers can inadvertently grind a weak server to a halt. You should probably read up on "robots.txt" and the appropriate way of interpreting this file before you unleash your creation on anything outside of your own personal test setup.
Upvotes: 3