Enno Shioji
Enno Shioji

Reputation: 26882

Crawling redirects later with Nutch

The nutch-default.xml suggests that there is a way to save redirect destination on the first crawl and crawl them on the next crawl by setting the http.redirect.max to 0.

The first crawl finished successfully and we could see the redirect response in the segments stored. Then we attempted to update the crawl DB to add the redirect destination to the next fetch list, but we couldn't make them included -- the fetch list seemed to be mostly empty, with just a few URLs that nutch failed to crawl on the first crawl.

Is there a parameter/config we need to give during parsing/updating/generating?

Upvotes: 0

Views: 310

Answers (1)

Tejas Patil
Tejas Patil

Reputation: 6169

The topN parameter must be increased so that all the urls will be picked up in the fetchlist. The selection of the urls in the 2nd round is based on the scores of the urls... i think that it cant be modified.

Upvotes: 1

Related Questions