Making scrapy.Request deteministics?

Question

This is not issues for me, I can live without it but I am just curious is it possible and how.

Today I have learned, that scrapy.Request will not finish in the same order as they have been started.

Pseudocode example:

class SomeSpider(scrapy.Spider):
    def parse(self, response):

        # get all ads(25) from ads list
        for ad in adList():
            add_url = findAddUrl()
            yield scrapy.Request(add_url, callback=self.parseAd)

        # go to next page
        if some_condition_OK:
             next_page_url = findNextpageUrl()
             yield scrapy.Request(next_page_url)
        else:
            print 'Stoped at.'

    def parseAd(self, response):
        field_1 = get_field_1()
        field_n = get_field_n()

        # save field_1 to field_n to sqlite DB

This is a simplified example of spider that I have coded and it is working fine.

But what I have learned today is that yield scrapy.Request will not finish in the same order as they are started.

In my example, on each page, there are 25 ads for each page, I start yield scrapy.Request(add_url, callback=self.parseAd) to get more information from each ad.
And after that, I go to next page with yield scrapy.Request(next_page_url).
But what I have noticed that some ads from page2 will finish before the all ads from page1.
I understand why and I see the benefit of this approach.

But my question is it possible to make scrapy.Request deterministic?

What I mean by deterministic is that each scrapy.Request will finish in the same order as it is started.

VMRuiz · Accepted Answer

The only way to make Scrapy deterministic is to yield only one requests at the same time, while keeping the rest of them in a list or queue:

class SomeSpider(scrapy.Spider):

    pending_request = []

    def parse(self, response):

        # get all ads(25) from ads list
        for ad in adList():
            add_url = findAddUrl()
            self.pending_request.append(
                scrapy.Request(add_url, callback=self.parseAd))

        # go to next page
        if some_condition_OK:
             next_page_url = findNextpageUrl()
             self.pending_request.append(scrapy.Request(next_page_url))
        else:
            print 'Stoped at.'

        if self.pending_request:
            yield self.pending_request.pop(0)

    def parseAd(self, response):
        field_1 = get_field_1()
        field_n = get_field_n()

        if self.pending_request:
            yield self.pending_request.pop(0)

Making scrapy.Request deteministics?

Answers (2)

Related Questions