Reputation: 6571
I have started using Scrapy to scrape a few websites. If I later add a new field to my model or change my parsing functions, I'd like to be able to "replay" the downloaded raw data offline to scrape it again. It looks like Scrapy had the ability to store raw data in a replay file at one point:
http://dev.scrapy.org/browser/scrapy/trunk/scrapy/command/commands/replay.py?rev=168
But this functionality seems to have been removed in the current version of Scrapy. Is there another way to achieve this?
Upvotes: 14
Views: 5829
Reputation: 7082
You can enable HTTPCACHE_ENABLED as said http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html?highlight=FilesystemCacheStorage#httpcache-enabled
to cache all http request and response to implement resume crawling.
OR try Jobs to pause and resume crawling http://scrapy.readthedocs.org/en/latest/topics/jobs.html
Upvotes: 5
Reputation: 18385
If you run crawl --record=[cache.file] [scraper]
, you'll be able then use replay [scraper]
.
Alternatively, you can cache all responses with the HttpCacheMiddleware
by including it in DOWNLOADER_MIDDLEWARES
:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 300,
}
If you do this, every time you run the scraper, it will check the file system first.
Upvotes: 22