In which order do the rules get evaluated in the CrawlSpider?

Question

I have a question regarding the order in which the rules get evaluated in a CrawlSpider. If I have the code below:

from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    start_urls = ['http://someurlhere.com']
    rules = (
        Rule(
            LinkExtractor(restrict_xpaths=[
                "//ul[@class='menu-categories']",
                "//ul[@class='menu-subcategories']"]),
            callback='first_callback'
        ),
        Rule(
            LinkExtractor(allow='/product.php?id=\d+'),
            callback='second_callback'
        )
    )

In this case:

The engine will send a request for 'http://someurlhere.com' in the start_url list and call the default parse callback when it gets the response.
Then in the parse method, from the response it gets from the above step it will extract the links from that response based of the xpath we provided to the FIRST LinkExtractor.

Now my question is the links that are extracted from the FIRST LinkExtractor rule, are they simply scheduled in the scheduler and not followed immediately? So after it schedules all the links which are extracted from the first LinkExtractor then it will call the first_callback method for all of those links with the response passed to that first_callback?

Also when is the second LinkExtractor going to be called? Does the first LinkExtractor get evaluated and then only the second LinkExtractor runs?

Rahul · Accepted Answer

If we go through the official documentation. The process is simple.

First, your start url is parsed and then every subsequent crawled pages links will be extracted by the rules provided.

Now coming to your question.

Now my question is the links that are extracted from the FIRST LinkExtractor rule, are they simply scheduled in the scheduler and not followed immediately? So after it schedules all the links which are extracted from the first LinkExtractor then it will call the first_callback method for all of those links with the response passed to that first_callback?

If callback is None follow defaults to True, otherwise it defaults to False. It means in your case, there will be no follow-up. Whatever link it has extracted from the start URL response is what you will have in scheduler and your crawling will end after parsing all these.

If you want to follow, break the rules. Find where is your content and where are the resourses.

# Extract links matching 'products' (but not matching 'shampoo')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('products', ), deny=('shampoo', ))),

# Extract links matching 'item' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item', )), callback='parse_item'),

Now coming to your second question:

Also when is the second LinkExtractor going to be called? Does the first LinkExtractor get evaluated and then only the second LinkExtractor runs?

One is not dependent on other. LinkExtractor Object apply regex or string matching independently. If they find their matching URL, they proceed with their callbacks or follow up.

In which order do the rules get evaluated in the CrawlSpider?

Answers (1)

Related Questions