user3839319
user3839319

Reputation: 45

Incremental crawling in Nutch

I'm new to Nutch and am doing a POC with Nutch 1.9. I am only trying to crawl my own site to set up a search on it. I find that the first crawl I do only crawls one page. The second crawls 40 pages, the third 300. the increments reduce and it crawls around 400 pages overall. Does anyone know why it doesn't just do the full crawl of the website on the first run? I used the nutch tutorial (http://wiki.apache.org/nutch/NutchTutorial) and am running using the script as per section 3.5.

I'm also finding with multiple runs it doesn't crawl the whole site anyway - GSA brings back over 900 pages for the same site - nutch brings back 400.

Thanks kindly

Jason

Upvotes: 0

Views: 1371

Answers (2)

Julien Nioche
Julien Nioche

Reputation: 4854

Why don't you use the Nutch mailing list? you'd get a larger audience and quicker answers from fellow Nutch users.

What value are you setting for the number of rounds when using the crawl script? Setting it to 1 means that you won't go further than the URLs in the seed list. Use a large value to crawl the whole site in a single call to the script.

The difference in the total number of URLs could be the max oulinks per page param as Kumar suggested but it could also be due to the URL filtering.

Upvotes: 0

Kumar
Kumar

Reputation: 3990

Upto my knowledge,

Nutch crawl the known links and getting inlinks and outlinks from the known pages then add those links into db for next crawl. It seems why nutch didn't crawl all pages at single run.

Incremental crawling means to crawl only new or updated pages and leaves the unmodified pages.

Nutch cralws only limited page because of your configuration settings. change it to crawl all pages. See here

If you want to make a search for one website, then take a look at Aperture. It will crawl whole website at single run. It provides incremental support.

Upvotes: 1

Related Questions