Apache Nutch: Amount of seeds changes the crawling behaviour

Question

I've worked a while with Apache Nutch and Solr to crawl and index some sites. Now there is a behaviour in Nutch I can't explain. There are two scenarios:

I start Nutch with a seedlist with one site.
I start Nutch with a seedlist with several sites and the site from scenario 1 is included as well.

For the single seed I 've included in both scenarios I expect that the same URLs were crawled. In my opinion there is no difference.

Anyways I wouldn't write here if my opinion would be right. The reality is that there are two different amount of crawled URLs. There are more crawled URLs in the first scenario. So, to conclude if I crawl a single seed, the crawl is more breadth than a seedlist with a bundle of sites.

Is this behaviour standard or is it unusual? Is it possibile that links from other seedpoints interrupt the process in a way that my analysed seed can't search all links? Is it a setting problem or just a Nutch thing.

Sebastian Nagel · Accepted Answer

There are a couple of configuration properties and parameters which influence the way how Nutch follows links. Your observation that adding more seeds (form different sites or hosts) causes a decrease in the amount of crawled documents/pages per host, could be easily explained by a limit on the number of pages fetched per round set via parameter -topN of the "generate" step. If the fetch list is limited to, e.g., 100 pages per round,

(with one single site/host) 100 pages can be fetched for this site
(with 10 sites) only approx. 10 pages are fetched per site

After the same number of rounds in the second scenario there are less pages fetched for one site.

As solution you could either increase -topN or the number of rounds (-depth).

Apache Nutch: Amount of seeds changes the crawling behaviour

Answers (1)

Related Questions