Reputation: 21
I am reading learning scrapy
by Dimitrios Kouzis-Loukas. Actually I have a question of the Two-direction crawling with a spider
part in chapter 3 page58.
The original code is like:
def parse(self, response):
# Get the next index URLs and yield Requests
next_selector = response.xpath('//*[contains(@class,"next")]//@href')
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
# Get item URLs and yield Requests
item_selector = response.xpath('//*[@itemprop="url"]/@href')
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url),
callback=self.parse_item)`
But from my understanding, should the second loop block be included into the first one so that we can first download the index page and then download all the information pages in the first page, after that move onto the next index page?
So I just wanna know the operating order of the original code, please help!
Upvotes: 1
Views: 79
Reputation: 28216
You can't really merge the two loops.
The Request
objects yielded in them have different callbacks.
The first one will be processed by the parse
method (which seems to be parsing a listing of multiple items), and the second by the parse_item
method (probably parsing the details of a single item).
As for the order of scraping, scrapy (by default) uses a LIFO queue, which means the last request created will be processed first.
However, due to the asynchronous nature of scrapy, it's impossible to say what the exact order will be.
Upvotes: 1