Reputation: 43
I'm running a self contained Scrapy spider, which lives in a single .py
file. In case of server fail/power outage/any other reason that the script might fail, is there an elegant way to make sure that I will be able to resume a run after recovery?
Maybe something similar to the built in JOBDIR setting?
Upvotes: 3
Views: 873
Reputation: 22248
You can still use JOBDIR option if you have a self-contained script, e.g. you can set a value in custom_settings attribute:
class MySpider(scrapy.Spider):
custom_settings = {
'JOBDIR': './job',
}
#...
Alternatively, you can set this option when creating CrawlerProcess (if that's what you're using to run spiders in a script):
process = CrawlerProcess({'JOBDIR': './job'})
process.crawl(MySpider)
process.start()
Upvotes: 2
Reputation: 21446
There's a whole documentation page covering this issue:
To start a spider with persistence supported enabled, run it like this:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Upvotes: 1
Reputation: 692
You can use supervisor.
[program:foo]
command=~/script_path/script.py
Upvotes: 0