Grainier
Grainier

Reputation: 1654

Scrapy CrawlSpider rules with multiple callbacks

I'm tring to create an ExampleSpider which implements scrapy CrawlSpider. My ExampleSpider should be able to process pages containing only artist info, pages containing only album info, and some other pages which contains both album and artist info.

I was able to handle First two scenarios. but the problem occurs in third scenario. I'm using parse_artist(response) method to process artist data, parse_album(response) method to process album data. My question is, If a page contains both artist and album data, how should I define my rules?

  1. Shoud I do like below? (Two rules for same url pattern)
  2. Should I multiple callbacks? (Does scrapy support multiple callbacks?)
  3. Is there other way to do it. (A proper way)

    class ExampleSpider(CrawlSpider):
        name = 'example'
    
        start_urls = ['http://www.example.com']
    
        rules = [
            Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
            Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
            # more rules .....
        ]
    
        def parse_artist(self, response):
            artist_item = ArtistItem()
            try:
                # do the scrape and assign to ArtistItem
            except Exception:
                # ignore for now
                pass
            return artist_item
            pass
    
        def parse_album(self, response):
            album_item = AlbumItem()
            try:
                # do the scrape and assign to AlbumItem
            except Exception:
                # ignore for now
                pass
            return album_item
            pass
        pass
    

Upvotes: 3

Views: 4422

Answers (1)

kev
kev

Reputation: 161734

The CrawlSpider calls _requests_to_follow() method to extract urls and generate requests to follow:

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        seen = seen.union(links)
        for link in links:
            r = Request(url=link.url, callback=self._response_downloaded)
            r.meta.update(rule=n, link_text=link.text)
            yield rule.process_request(r)

As you can see:

  • The variable seen memorizes urls has been processed.
  • Every url will be parsed by at most one callback.

You can define a parse_item() to call parse_artist() and parse_album():

rules = [
    Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
    # more rules .....
]

def parse_item(self, response):

    yield self.parse_artist(response)
    yield self.parse_album(response)

Upvotes: 9

Related Questions