Reputation: 45
I have started studying Scrapy to scrape a website. I built a simple scraper to find my items and I put my storage raw data on AWS-S3.
In order to meet requirements, I enable the Scrapy cache. To do so, I add this s3 extension:
It worked fine and I saw my cache folder on S3. Now, I'd like to be able to "replay" the raw "s3-data" to scrape it again in case I need to take other items or change my parse. Is there a way to achieve this?
My runner code:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from my_scraper.spiders.my_scraper import MyScraper
from datetime import datetime
settings = get_project_settings()
#settings['LOG_LEVEL'] = 'INFO'
#* Data File output
date = datetime.strftime(datetime.now(), '%Y%m%d')
settings['FEED_URI'] = 's3://BUCKET/KEY/PREFIX-DATA/dumpdate=%s/test.json' %date
#settings['FEED_URI'] = './data/test_handler.json'
#settings['FEED_FORMAT'] = 'json'
#settings['LOG_FILE'] = 'Q1.log'
#* Enabled cache.
settings['HTTPCACHE_EXPIRATION_SECS'] = 60 * 60 * 24 * 7 # Life Time cache
settings['HTTPCACHE_DIR'] = 'httpcache' #Local cache dir
settings['HTTPCACHE_ENABLED'] = True
#* Extension
settings["HTTPCACHE_STORAGE"] = "my_scraper.extensions.s3cache.S3CacheStorage"
settings["S3CACHE_URI"] = 's3://BUCKET/KEY/PREFIX-CACHE/dumpdate=%s' %date
process = CrawlerProcess(settings=settings)
process.crawl(MyScraper)
process.start()
Upvotes: 1
Views: 596
Reputation: 33223
So long as your custom cache storage implements retrieve_response
, I would expect it to Just Work™
However, if you didn't store the requests in S3 indexed by their fingerprint hash the way that the FilesystemCacheStorage
does it, or a similar scheme, I wouldn't expect you'll be able to find the Request
and Response
objects to return from that method.
Upvotes: 1