Reputation: 425
I've worked a while with Apache Nutch and Solr to crawl and index some sites. Now there is a behaviour in Nutch I can't explain. There are two scenarios:
For the single seed I 've included in both scenarios I expect that the same URLs were crawled. In my opinion there is no difference.
Anyways I wouldn't write here if my opinion would be right. The reality is that there are two different amount of crawled URLs. There are more crawled URLs in the first scenario. So, to conclude if I crawl a single seed, the crawl is more breadth than a seedlist with a bundle of sites.
Is this behaviour standard or is it unusual? Is it possibile that links from other seedpoints interrupt the process in a way that my analysed seed can't search all links? Is it a setting problem or just a Nutch thing.
Upvotes: 0
Views: 139
Reputation: 2239
There are a couple of configuration properties and parameters which influence the way how Nutch follows links. Your observation that adding more seeds (form different sites or hosts) causes a decrease in the amount of crawled documents/pages per host, could be easily explained by a limit on the number of pages fetched per round set via parameter -topN of the "generate" step. If the fetch list is limited to, e.g., 100 pages per round,
After the same number of rounds in the second scenario there are less pages fetched for one site.
As solution you could either increase -topN or the number of rounds (-depth).
Upvotes: 3