How to set a rule according to the current URL?

Question

I'm using Scrapy and I want to be able to have more control over the crawler. To do this I would like to set rules depending on the current URL that I am processing.

For example if I am on example.com/a I want to apply a rule with LinkExtractor(restrict_xpaths='//div[@class="1"]'). And if I'm on example.com/b I want to use another Rule with a different Link Extractor.

How do I accomplish this?

Elias Dorneles · Accepted Answer

I'd just code them in separate callbacks, instead of relying in the CrawlSpider rules.

def parse(self, response):
    extractor = LinkExtractor(.. some default ..)

    if 'example.com/a' in response.url:
        extractor = LinkExtractor(restrict_xpaths='//div[@class="1"]')

    for link in extractor.extract_links(response):
        yield scrapy.Request(link.url, callback=self.whatever)

This is better than trying to change the rules at runtime, because the rules are supposed to be the same for all callbacks.

In this case I've just used link extractors, but if you want to use different rules you can do about the same thing, mirroring the same code to handle rules in the loop shown from CrawlSpider._requests_to_follow.

How to set a rule according to the current URL?

Answers (1)

Related Questions