Reputation: 26882
The nutch-default.xml suggests that there is a way to save redirect destination on the first crawl and crawl them on the next crawl by setting the http.redirect.max
to 0.
The first crawl finished successfully and we could see the redirect response in the segments stored. Then we attempted to update the crawl DB to add the redirect destination to the next fetch list, but we couldn't make them included -- the fetch list seemed to be mostly empty, with just a few URLs that nutch failed to crawl on the first crawl.
Is there a parameter/config we need to give during parsing/updating/generating?
Upvotes: 0
Views: 310
Reputation: 6169
The topN parameter must be increased so that all the urls will be picked up in the fetchlist. The selection of the urls in the 2nd round is based on the scores of the urls... i think that it cant be modified.
Upvotes: 1