Reputation: 283
I am currently using cronjob to run a crawler every night, which can only run at night. But I found sometimes the data would be huge and one night would not be enough for crawling everything. So I have to kill that process at morning like 6:00 am. How can I kill just kill the crawler process using cronjob?
Upvotes: 0
Views: 150
Reputation: 4864
Depends what you use for crawling, but with StormCrawler which runs continuously, you can have one cron job to start the crawl by calling the 'storm jar ...' command and another one to kill it with 'storm kill ...'. With Apache Nutch, you can achieve the same thing by listing the hadoop jobs currently running and kill it. It would however be cleaner to let the current iteration finish and parse and index the segment before terminating the crawl. Again, it depends on the crawler you are using.
Upvotes: 1