GoetzOnline
GoetzOnline

Reputation: 427

With nutch crawl, if I use smaller values for -topN and -depth, will it still crawl all the same pages?

I am running Nutch 1.4/Solr 4.10 to index a number of sites. My crawl includes a number of seed pages with several hundred links. I am currently running with

-topN 400 -depth 20

With these settings it takes 5-7 hours to complete the crawl. I would like to have each individual iteration of "nutch crawl" take less time, but I need to ensure all pages are crawled eventually. Can I reduce either my -topN or -depth values and still be sure all pages will be crawled?

Upvotes: 0

Views: 145

Answers (1)

Julien Nioche
Julien Nioche

Reputation: 4854

Changing depth (should have a different name really, it's the number of iterations which is often the same as depth but not necessarily) won't make much of a difference as the crawl will stop iterating as soon as there aren't any more URLs to fetch. The topN limits the total number of URLs per segment : if you put a lower value more iteration will be done but as a whole it shouldn't affect how long your crawl takes.

There are many factors affecting the speed of a crawl see WIKI but it's merely a matter of host diversity and politeness. I'd recommend that you run Nutch in pseudo distributed mode and use the Hadoop UI to understand which steps take time and take it from there.

PS: that's a very old version of Nutch. Maybe time to upgrade to a more recent one?

Upvotes: 0

Related Questions