user2066880
user2066880

Reputation: 5034

How to set a rule according to the current URL?

I'm using Scrapy and I want to be able to have more control over the crawler. To do this I would like to set rules depending on the current URL that I am processing.

For example if I am on example.com/a I want to apply a rule with LinkExtractor(restrict_xpaths='//div[@class="1"]'). And if I'm on example.com/b I want to use another Rule with a different Link Extractor.

How do I accomplish this?

Upvotes: 2

Views: 57

Answers (1)

Elias Dorneles
Elias Dorneles

Reputation: 23806

I'd just code them in separate callbacks, instead of relying in the CrawlSpider rules.

def parse(self, response):
    extractor = LinkExtractor(.. some default ..)

    if 'example.com/a' in response.url:
        extractor = LinkExtractor(restrict_xpaths='//div[@class="1"]')

    for link in extractor.extract_links(response):
        yield scrapy.Request(link.url, callback=self.whatever)

This is better than trying to change the rules at runtime, because the rules are supposed to be the same for all callbacks.

In this case I've just used link extractors, but if you want to use different rules you can do about the same thing, mirroring the same code to handle rules in the loop shown from CrawlSpider._requests_to_follow.

Upvotes: 2

Related Questions