jwoww
jwoww

Reputation: 1071

Pointing Scrapy at a local cache instead of performing a normal spidering process

I'm using pipelines to cache the documents from Scrapy crawls into a database, so that I can reparse them if I change the item parsing logic without having to hit the server again.

What's the best way to have Scrapy process from the cache instead of trying to perform a normal crawl?

I like scrapy's support for CSS and XPath selectors, else I would just hit the database separately with a lxml parser.

For a time, I wasn't caching the document at all and using Scrapy in a normal fashion - parsing the items on the fly - but I've found that changing the item logic requires a time and resource intensive recrawl. Instead, I'm now caching the document body along with the item parse, and I want to have the option to have Scrapy iterate through those documents from a database instead of crawling the target URL.

How do I go about modifying Scrapy to give me the option to pass it a set of documents and then parsing them individually as if it had just pulled them down from the web?

Upvotes: 4

Views: 782

Answers (1)

alecxe
alecxe

Reputation: 473903

I think a custom Downloader Middleware is a good way to go. The idea is to have this middleware return a source code directly from the database and don't let Scrapy make any HTTP requests.

Sample implementation (not tested and definitely needs error-handling):

import re

import MySQLdb
from scrapy.http import Response    
from scrapy.exceptions import IgnoreRequest
from scrapy import log
from scrapy.conf import settings


class CustomDownloaderMiddleware(object):
   def __init__(self, *args, **kwargs):
        super(CustomDownloaderMiddleware, self).__init__(*args, **kwargs)

        self.connection = MySQLdb.connect(**settings.DATABASE)
        self.cursor = self.connection.cursor()

   def process_request(self, request, spider):
       # extracting product id from a url
       product_id = re.search(request.url, r"(\d+)$").group(1)

       # getting cached source code from the database by product id
       self.cursor.execute("""
           SELECT
               source_code
           FROM
               products
           WHERE
               product_id = %s
       """, product_id)

       source_code = self.cursor.fetchone()[0]

       # making HTTP response instance without actually hitting the web-site
       return Response(url=request.url, body=source_code)

And don't forget to activate the middleware.

Upvotes: 1

Related Questions