Reputation: 3326
I have a Scrapy CrawlSpider that has a very large list of URLs to crawl. I would like to be able to stop it, saving the current status and resume it later without having to start over. Is there a way to accomplish this within the Scrapy framework?
Upvotes: 13
Views: 5865
Reputation: 157
Just wanted to share that feature is included in latest scrapy version, but parameter name is changed. You should use it like this:
scrapy crawl thespider --set JOBDIR=run1
For more information here http://doc.scrapy.org/en/latest/topics/jobs.html#job-directory
Upvotes: 10
Reputation: 679
Scrapy now has the working feature for this on their site documented here:
Here's the actual command:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Upvotes: 2
Reputation: 4002
There was a question on the ML just few months ago: http://groups.google.com/group/scrapy-users/browse_thread/thread/6a8df07daff723fc?pli=1
Quote Pablo:
We're not only considering it, but also working on it. There are currently two working patches in my MQ that add this functionality in case anyone wants to try an early preview (they need to be applied in order): http://hg.scrapy.org/users/pablo/mq/file/tip/scheduler_single_spider.... http://hg.scrapy.org/users/pablo/mq/file/tip/persistent_scheduler.patch To run a spider as before (no persistence):
scrapy crawl thespider
To run a spider storing scheduler+dupefilter state in a dir:
scrapy crawl thespider --set SCHEDULER_DIR=run1
During the crawl, you can hit ^C to cancel the crawl and resume it later with:
scrapy crawl thespider --set SCHEDULER_DIR=run1
The SCHEDULER_DIR setting name is bound to change before the final release, but the idea will be the same - that you pass a directory where to persist the state.
Upvotes: 6