Reputation: 28364
I'm writing a multi-threaded Java web crawler. From what I understand of the web, when a user loads a web page the browser requests the first document (eg, index.html) and as it receives the HTML it will find other resources that need to be included (images, CSS, JS) and ask for those resources concurrently.
My crawler is only requesting the original document. For some reason, I can't get it to scrape more than 2 to 5 pages every 5 seconds. I'm spinning up a new thread for every HttpURLConnection I am making. It seems like I should be able to be at least scraping 20-40 pages per second. If I try to spin up 100 threads I get I/O exceptions like crazy. Any ideas what's going on?
Upvotes: 2
Views: 968
Reputation: 32335
It would be a good idea to look at your code as you might have done something slightly wrong and that breaks your crawler, but as a general rule of thumb doing asynchronous IO is far superior then the blocking IO that HttpURLConnection offers. Asynchronous IO allows you to handle all of the processing in a single thread and all the actual IO is done by the operating system on its own time.
For a good implementation of the HTTP protocol over asynchronous IO look at Apache's HTTP core. See an example of such a client here.
Upvotes: 1
Reputation: 341
Oh, and I hope you're close()ing your inputstreams that you get from the connections. They're getting closed in the finalizer of the Connection anyway, but that may easily be seconds later. I ran into that issue myself, so maybe that helps you.
Upvotes: 0
Reputation: 24447
The best count of threads or HttpUrlConnections depends on many factors.
Upvotes: 0
Reputation: 8586
Details on what -kind- of IOExceptions you're receiving might be handy. There are a few possibilities to consider.
Upvotes: 0