Scrapy CrawlSpider rules with multiple callbacks

Question

I'm tring to create an ExampleSpider which implements scrapy CrawlSpider. My ExampleSpider should be able to process pages containing only artist info, pages containing only album info, and some other pages which contains both album and artist info.

I was able to handle First two scenarios. but the problem occurs in third scenario. I'm using parse_artist(response) method to process artist data, parse_album(response) method to process album data. My question is, If a page contains both artist and album data, how should I define my rules?

Shoud I do like below? (Two rules for same url pattern)
Should I multiple callbacks? (Does scrapy support multiple callbacks?)

Is there other way to do it. (A proper way)

class ExampleSpider(CrawlSpider):
    name = 'example'

    start_urls = ['http://www.example.com']

    rules = [
        Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
        Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
        # more rules .....
    ]

    def parse_artist(self, response):
        artist_item = ArtistItem()
        try:
            # do the scrape and assign to ArtistItem
        except Exception:
            # ignore for now
            pass
        return artist_item
        pass

    def parse_album(self, response):
        album_item = AlbumItem()
        try:
            # do the scrape and assign to AlbumItem
        except Exception:
            # ignore for now
            pass
        return album_item
        pass
    pass

kev · Accepted Answer

The CrawlSpider calls _requests_to_follow() method to extract urls and generate requests to follow:

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        seen = seen.union(links)
        for link in links:
            r = Request(url=link.url, callback=self._response_downloaded)
            r.meta.update(rule=n, link_text=link.text)
            yield rule.process_request(r)

As you can see:

The variable seen memorizes urls has been processed.
Every url will be parsed by at most one callback.

You can define a parse_item() to call parse_artist() and parse_album():

rules = [
    Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
    # more rules .....
]

def parse_item(self, response):

    yield self.parse_artist(response)
    yield self.parse_album(response)

Scrapy CrawlSpider rules with multiple callbacks

Answers (1)

Related Questions