Reputation:
I have a question regarding the order in which the rules get evaluated in a CrawlSpider. If I have the code below:
from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
start_urls = ['http://someurlhere.com']
rules = (
Rule(
LinkExtractor(restrict_xpaths=[
"//ul[@class='menu-categories']",
"//ul[@class='menu-subcategories']"]),
callback='first_callback'
),
Rule(
LinkExtractor(allow='/product.php?id=\d+'),
callback='second_callback'
)
)
In this case:
'http://someurlhere.com'
in the start_url
list and call the default parse
callback when it gets the response.Now my question is the links that are extracted from the FIRST LinkExtractor
rule, are they simply scheduled in the scheduler and not followed immediately? So after it schedules all the links which are extracted from the first LinkExtractor
then it will call the first_callback
method for all of those links with the response passed to that first_callback
?
Also when is the second LinkExtractor
going to be called? Does the first LinkExtractor
get evaluated and then only the second LinkExtractor
runs?
Upvotes: 2
Views: 496
Reputation: 2100
If we go through the official documentation. The process is simple.
First, your start url is parsed and then every subsequent crawled pages links will be extracted by the rules provided.
Now coming to your question.
Now my question is the links that are extracted from the FIRST LinkExtractor rule, are they simply scheduled in the scheduler and not followed immediately? So after it schedules all the links which are extracted from the first LinkExtractor then it will call the first_callback method for all of those links with the response passed to that first_callback?
If callback is None
follow defaults to True
, otherwise it defaults to False
. It means in your case, there will be no follow-up. Whatever link it has extracted from the start URL response is what you will have in scheduler and your crawling will end after parsing all these.
If you want to follow, break the rules. Find where is your content and where are the resourses.
# Extract links matching 'products' (but not matching 'shampoo')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('products', ), deny=('shampoo', ))),
# Extract links matching 'item' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('item', )), callback='parse_item'),
Now coming to your second question:
Also when is the second LinkExtractor going to be called? Does the first LinkExtractor get evaluated and then only the second LinkExtractor runs?
One is not dependent on other. LinkExtractor
Object apply regex or string matching independently. If they find their matching URL, they proceed with their callbacks or follow up.
Upvotes: 0