Reputation: 777
I have an issue where I try to issue a new crawl on something ive already crawled, but with some new URLS.
so first i have
urls/urls.txt -> www.somewebsite.com
i then issue the command
bin/nutch crawl urls -dir crawl -depth 60 -threads 50
i then update urls/urls.txt -> remove www.somewebsite.com -> add www.anotherwebsite.com
i issue the command
bin/nutch inject crawl urls
bin/nutch crawl urls -dir crawl -depth 60 -threads 50
What i would expect here, is that www.anotherwebsite.com is injected into the existing 'crawl' db, and when crawl is issued again it should only crawl the new website ive added www.anotherwebsite.com (as the refetch for the original is set to 30 days)
What I have experienced is that either
1.) no website is crawled
2.) only the original website is crawled
'sometimes' if i leave it for a few hours it starts working and picks up the new website and crawls both the old website and new one (even though the refetch time is set to 30 days)
its very weird and unpredictable behaviour.
Im pretty sure my regex-urlfilter file is set correctly, and my nutch-site / nutch-default is all setup with defaults (near enough).
Questions:
can anyone explain simply (with commands) what is happening during each crawl, and how to update an existing crawl db with some new urls?
can anyone explain (with commands) how i force a recrawl of 'all' urls in the crawl db? - i have issued a readdb and checked the refetch times, and most are set to a month, but what if i want to refetch again sooner?
Upvotes: 1
Views: 1494
Reputation: 1491
Article Here explains the crawl process in sufficient depth
Upvotes: 3