del
del

Reputation: 6571

Replay a Scrapy spider on stored data

I have started using Scrapy to scrape a few websites. If I later add a new field to my model or change my parsing functions, I'd like to be able to "replay" the downloaded raw data offline to scrape it again. It looks like Scrapy had the ability to store raw data in a replay file at one point:

http://dev.scrapy.org/browser/scrapy/trunk/scrapy/command/commands/replay.py?rev=168

But this functionality seems to have been removed in the current version of Scrapy. Is there another way to achieve this?

Upvotes: 14

Views: 5829

Answers (2)

fxp
fxp

Reputation: 7082

You can enable HTTPCACHE_ENABLED as said http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html?highlight=FilesystemCacheStorage#httpcache-enabled

to cache all http request and response to implement resume crawling.

OR try Jobs to pause and resume crawling http://scrapy.readthedocs.org/en/latest/topics/jobs.html

Upvotes: 5

Tim McNamara
Tim McNamara

Reputation: 18385

If you run crawl --record=[cache.file] [scraper], you'll be able then use replay [scraper].

Alternatively, you can cache all responses with the HttpCacheMiddleware by including it in DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 300,
}

If you do this, every time you run the scraper, it will check the file system first.

Upvotes: 22

Related Questions