Replay a Scrapy spider using cache stored on S3

Question

I have started studying Scrapy to scrape a website. I built a simple scraper to find my items and I put my storage raw data on AWS-S3.

In order to meet requirements, I enable the Scrapy cache. To do so, I add this s3 extension:

scrapy-fargate-sls-guide

It worked fine and I saw my cache folder on S3. Now, I'd like to be able to "replay" the raw "s3-data" to scrape it again in case I need to take other items or change my parse. Is there a way to achieve this?

My runner code:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from my_scraper.spiders.my_scraper import MyScraper

from datetime import datetime

settings = get_project_settings()
#settings['LOG_LEVEL'] = 'INFO'

#* Data File output
date = datetime.strftime(datetime.now(), '%Y%m%d')
settings['FEED_URI'] = 's3://BUCKET/KEY/PREFIX-DATA/dumpdate=%s/test.json' %date 
#settings['FEED_URI'] = './data/test_handler.json'
#settings['FEED_FORMAT'] = 'json'
#settings['LOG_FILE'] = 'Q1.log'


#* Enabled cache.
settings['HTTPCACHE_EXPIRATION_SECS'] =  60 * 60 * 24 * 7 # Life Time cache
settings['HTTPCACHE_DIR'] = 'httpcache' #Local cache dir
settings['HTTPCACHE_ENABLED'] = True
#* Extension 
settings["HTTPCACHE_STORAGE"] = "my_scraper.extensions.s3cache.S3CacheStorage"
settings["S3CACHE_URI"] = 's3://BUCKET/KEY/PREFIX-CACHE/dumpdate=%s' %date      


process = CrawlerProcess(settings=settings)
process.crawl(MyScraper)
process.start()

mdaniel · Accepted Answer

So long as your custom cache storage implements retrieve_response, I would expect it to Just Work™

However, if you didn't store the requests in S3 indexed by their fingerprint hash the way that the FilesystemCacheStorage does it, or a similar scheme, I wouldn't expect you'll be able to find the Request and Response objects to return from that method.

Replay a Scrapy spider using cache stored on S3

Answers (1)

Related Questions