Reputation: 332
Appreciate someone can help me understand how rules stack for depth crawling. Does Stacking multiple rules result in Rules being processed one at a time. The aim is to grab links from MainPage, return the items and the responses, and pass it to the next rule which will pass of the links to another function and so on.
rules = {
Rule(LinkExtractor(restrict_xpaths=(--some xpath--)), callback='function_a', follow=True)
Rule(linkExtractor(restrict_xpaths=(--some xpath--)),callback='function_b', process_links='function_c', follow=True),
)
def function_a(self, response): --grab sports, games, link3 from main page--
item = ItemA()
i = response.xpath('---some xpath---')
for xpth in i:
item['name'] = xpth('---some xpath--')
yield item, scrapy.Request(url) // yield each item and url link from function_a back to the second rule
def function_b(self, response) -- receives responses from second rule--
//grab links same as function_a
def function_c(self, response) -- does process_links in the rule send the links it received to function_c?
Can this be done recursively to deep crawl a single site? I'm not sure if I got the rules concept correct. Do I have to add X num of rules to process X depth pages or is there a better way to process recursive depth crawls.
Thanks
Upvotes: 3
Views: 1072
Reputation: 976
From the docs the following passage implies that every rule is applied to every page. (My italics)
rules
Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
In your case target each rule to the appropriate page and then order the rules in depth order.
Upvotes: 1