Reputation: 21
I have many sites; contents of some change every month and content of some change every day. nutch 1.3 crawled them befor now i want to recrawl them with different scheduled crawling. how i can do that? thanks.
Upvotes: 0
Views: 488
Reputation: 807
You can specify fetch interval (time between two consecutive crawls) for each entry in your seed file like this:
http://daily.com \t nutch.fetchInterval=86400
http://montly.com \t nutch.fetchInterval=2592000
If you are using AdaptiveFetchSchedule
the above entries just set the starting interval and after each recrawl depending on whether the page is changed or not this interval will be increased or decreased. In this case, if you always want a fixed interval you can use nutch.fetchInterval.fixed
instead of nutch.fetchInterval
in above lines.
Upvotes: 1
Reputation: 243
You can write a shell script in which you can specify the command names which you use to run crawler and use cron command in linux to scedule the execution of this script.
http://www.thegeekstuff.com/2011/07/cron-every-5-minutes/
Even google crawls the whole web repeatedly after some interval of time.
Upvotes: 2