Pointing Scrapy at a local cache instead of performing a normal spidering process

Question

I'm using pipelines to cache the documents from Scrapy crawls into a database, so that I can reparse them if I change the item parsing logic without having to hit the server again.

What's the best way to have Scrapy process from the cache instead of trying to perform a normal crawl?

I like scrapy's support for CSS and XPath selectors, else I would just hit the database separately with a lxml parser.

For a time, I wasn't caching the document at all and using Scrapy in a normal fashion - parsing the items on the fly - but I've found that changing the item logic requires a time and resource intensive recrawl. Instead, I'm now caching the document body along with the item parse, and I want to have the option to have Scrapy iterate through those documents from a database instead of crawling the target URL.

How do I go about modifying Scrapy to give me the option to pass it a set of documents and then parsing them individually as if it had just pulled them down from the web?

Pointing Scrapy at a local cache instead of performing a normal spidering process

Answers (1)

Related Questions