Reputation: 1071
I'm using pipelines to cache the documents from Scrapy crawls into a database, so that I can reparse them if I change the item parsing logic without having to hit the server again.
What's the best way to have Scrapy process from the cache instead of trying to perform a normal crawl?
I like scrapy's support for CSS and XPath selectors, else I would just hit the database separately with a lxml parser.
For a time, I wasn't caching the document at all and using Scrapy in a normal fashion - parsing the items on the fly - but I've found that changing the item logic requires a time and resource intensive recrawl. Instead, I'm now caching the document body along with the item parse, and I want to have the option to have Scrapy iterate through those documents from a database instead of crawling the target URL.
How do I go about modifying Scrapy to give me the option to pass it a set of documents and then parsing them individually as if it had just pulled them down from the web?
Upvotes: 4
Views: 782
Reputation: 473903
I think a custom Downloader Middleware is a good way to go. The idea is to have this middleware return a source code directly from the database and don't let Scrapy make any HTTP requests.
Sample implementation (not tested and definitely needs error-handling):
import re
import MySQLdb
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
from scrapy import log
from scrapy.conf import settings
class CustomDownloaderMiddleware(object):
def __init__(self, *args, **kwargs):
super(CustomDownloaderMiddleware, self).__init__(*args, **kwargs)
self.connection = MySQLdb.connect(**settings.DATABASE)
self.cursor = self.connection.cursor()
def process_request(self, request, spider):
# extracting product id from a url
product_id = re.search(request.url, r"(\d+)$").group(1)
# getting cached source code from the database by product id
self.cursor.execute("""
SELECT
source_code
FROM
products
WHERE
product_id = %s
""", product_id)
source_code = self.cursor.fetchone()[0]
# making HTTP response instance without actually hitting the web-site
return Response(url=request.url, body=source_code)
And don't forget to activate the middleware.
Upvotes: 1