Reputation: 427
I am running Nutch 1.4/Solr 4.10 to index a number of sites. My crawl includes a number of seed pages with several hundred links. I am currently running with
-topN 400 -depth 20
With these settings it takes 5-7 hours to complete the crawl. I would like to have each individual iteration of "nutch crawl" take less time, but I need to ensure all pages are crawled eventually. Can I reduce either my -topN or -depth values and still be sure all pages will be crawled?
Upvotes: 0
Views: 145
Reputation: 4854
Changing depth (should have a different name really, it's the number of iterations which is often the same as depth but not necessarily) won't make much of a difference as the crawl will stop iterating as soon as there aren't any more URLs to fetch. The topN limits the total number of URLs per segment : if you put a lower value more iteration will be done but as a whole it shouldn't affect how long your crawl takes.
There are many factors affecting the speed of a crawl see WIKI but it's merely a matter of host diversity and politeness. I'd recommend that you run Nutch in pseudo distributed mode and use the Hadoop UI to understand which steps take time and take it from there.
PS: that's a very old version of Nutch. Maybe time to upgrade to a more recent one?
Upvotes: 0