mina
mina

Reputation: 21

How can recrawl different sites with different scheduled crawling in nutch 1.3?

I have many sites; contents of some change every month and content of some change every day. nutch 1.3 crawled them befor now i want to recrawl them with different scheduled crawling. how i can do that? thanks.

Upvotes: 0

Views: 488

Answers (2)

tahagh
tahagh

Reputation: 807

You can specify fetch interval (time between two consecutive crawls) for each entry in your seed file like this:

http://daily.com \t nutch.fetchInterval=86400
http://montly.com \t nutch.fetchInterval=2592000

If you are using AdaptiveFetchSchedule the above entries just set the starting interval and after each recrawl depending on whether the page is changed or not this interval will be increased or decreased. In this case, if you always want a fixed interval you can use nutch.fetchInterval.fixed instead of nutch.fetchInterval in above lines.

Upvotes: 1

Lina Clark
Lina Clark

Reputation: 243

You can write a shell script in which you can specify the command names which you use to run crawler and use cron command in linux to scedule the execution of this script.

http://www.thegeekstuff.com/2011/07/cron-every-5-minutes/

Even google crawls the whole web repeatedly after some interval of time.

Upvotes: 2

Related Questions