Reputation: 13
I am trying to crawl website using Nutch. I use commands:
I noticed what Nutch fetches already fetched URLs on each loop iteration.
Config I have made:
Added config to nutch-site.xml:
I use commands:
I have tried versions of Nutch 2.2.1 with MySQL and 2.3 with MongoDB. Result is same already fetched URLs are re-feched on each crawl loop iteration.
What I should to do to fetch all not crawled URLs?
Upvotes: 0
Views: 442
Reputation: 11
This is an open issue for Nutch 2.X. I faced it this weekend too.
The fix is scheduled for release 2.3.1: https://issues.apache.org/jira/browse/NUTCH-1922.
Upvotes: 1