Reputation: 43

Pausing and resuming a self contained scrapy script

I'm running a self contained Scrapy spider, which lives in a single .py file. In case of server fail/power outage/any other reason that the script might fail, is there an elegant way to make sure that I will be able to resume a run after recovery?

Maybe something similar to the built in JOBDIR setting?

Upvotes: 3

Answers (3)

Mikhail Korobov

Reputation: 22248

You can still use JOBDIR option if you have a self-contained script, e.g. you can set a value in custom_settings attribute:

class MySpider(scrapy.Spider):
    custom_settings = {
        'JOBDIR': './job',
    }
    #...

Alternatively, you can set this option when creating CrawlerProcess (if that's what you're using to run spiders in a script):

process = CrawlerProcess({'JOBDIR': './job'})
process.crawl(MySpider)
process.start()

Upvotes: 2

Granitosaurus

Reputation: 21446

There's a whole documentation page covering this issue:

To start a spider with persistence supported enabled, run it like this:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Upvotes: 1

M.A.K. Simanto

Reputation: 692

You can use supervisor.

[program:foo]
command=~/script_path/script.py

Upvotes: 0

Pausing and resuming a self contained scrapy script

Answers (3)

Related Questions