Dave Forgac
Dave Forgac

Reputation: 3326

How can I stop a scrapy CrawlSpider and later resume where it left-off?

I have a Scrapy CrawlSpider that has a very large list of URLs to crawl. I would like to be able to stop it, saving the current status and resume it later without having to start over. Is there a way to accomplish this within the Scrapy framework?

Upvotes: 13

Views: 5865

Answers (3)

niko_gramophon
niko_gramophon

Reputation: 157

Just wanted to share that feature is included in latest scrapy version, but parameter name is changed. You should use it like this:

 scrapy crawl thespider --set JOBDIR=run1

For more information here http://doc.scrapy.org/en/latest/topics/jobs.html#job-directory

Upvotes: 10

Thang Tran
Thang Tran

Reputation: 679

Scrapy now has the working feature for this on their site documented here:

Here's the actual command:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Upvotes: 2

naeg
naeg

Reputation: 4002

There was a question on the ML just few months ago: http://groups.google.com/group/scrapy-users/browse_thread/thread/6a8df07daff723fc?pli=1

Quote Pablo:

We're not only considering it, but also working on it. There are currently two working patches in my MQ that add this functionality in case anyone wants to try an early preview (they need to be applied in order): http://hg.scrapy.org/users/pablo/mq/file/tip/scheduler_single_spider.... http://hg.scrapy.org/users/pablo/mq/file/tip/persistent_scheduler.patch To run a spider as before (no persistence):

scrapy crawl thespider 

To run a spider storing scheduler+dupefilter state in a dir:

scrapy crawl thespider --set SCHEDULER_DIR=run1 

During the crawl, you can hit ^C to cancel the crawl and resume it later with:

scrapy crawl thespider --set SCHEDULER_DIR=run1 

The SCHEDULER_DIR setting name is bound to change before the final release, but the idea will be the same - that you pass a directory where to persist the state.

Upvotes: 6

Related Questions