Incremental crawling in Nutch

Question

I'm new to Nutch and am doing a POC with Nutch 1.9. I am only trying to crawl my own site to set up a search on it. I find that the first crawl I do only crawls one page. The second crawls 40 pages, the third 300. the increments reduce and it crawls around 400 pages overall. Does anyone know why it doesn't just do the full crawl of the website on the first run? I used the nutch tutorial (http://wiki.apache.org/nutch/NutchTutorial) and am running using the script as per section 3.5.

I'm also finding with multiple runs it doesn't crawl the whole site anyway - GSA brings back over 900 pages for the same site - nutch brings back 400.

Thanks kindly

Jason

Julien Nioche · Accepted Answer

Why don't you use the Nutch mailing list? you'd get a larger audience and quicker answers from fellow Nutch users.

What value are you setting for the number of rounds when using the crawl script? Setting it to 1 means that you won't go further than the URLs in the seed list. Use a large value to crawl the whole site in a single call to the script.

The difference in the total number of URLs could be the max oulinks per page param as Kumar suggested but it could also be due to the URL filtering.

Incremental crawling in Nutch

Answers (2)

Related Questions