WebOrCode
WebOrCode

Reputation: 7294

Making scrapy.Request deteministics?

This is not issues for me, I can live without it but I am just curious is it possible and how.

Today I have learned, that scrapy.Request will not finish in the same order as they have been started.

Pseudocode example:

class SomeSpider(scrapy.Spider):
    def parse(self, response):

        # get all ads(25) from ads list
        for ad in adList():
            add_url = findAddUrl()
            yield scrapy.Request(add_url, callback=self.parseAd)

        # go to next page
        if some_condition_OK:
             next_page_url = findNextpageUrl()
             yield scrapy.Request(next_page_url)
        else:
            print 'Stoped at.'

    def parseAd(self, response):
        field_1 = get_field_1()
        field_n = get_field_n()

        # save field_1 to field_n to sqlite DB

This is a simplified example of spider that I have coded and it is working fine.

But what I have learned today is that yield scrapy.Request will not finish in the same order as they are started.

In my example, on each page, there are 25 ads for each page, I start yield scrapy.Request(add_url, callback=self.parseAd) to get more information from each ad.
And after that, I go to next page with yield scrapy.Request(next_page_url).
But what I have noticed that some ads from page2 will finish before the all ads from page1.
I understand why and I see the benefit of this approach.

But my question is it possible to make scrapy.Request deterministic?

What I mean by deterministic is that each scrapy.Request will finish in the same order as it is started.

Upvotes: 1

Views: 155

Answers (2)

VMRuiz
VMRuiz

Reputation: 1981

The only way to make Scrapy deterministic is to yield only one requests at the same time, while keeping the rest of them in a list or queue:

class SomeSpider(scrapy.Spider):

    pending_request = []

    def parse(self, response):

        # get all ads(25) from ads list
        for ad in adList():
            add_url = findAddUrl()
            self.pending_request.append(
                scrapy.Request(add_url, callback=self.parseAd))

        # go to next page
        if some_condition_OK:
             next_page_url = findNextpageUrl()
             self.pending_request.append(scrapy.Request(next_page_url))
        else:
            print 'Stoped at.'

        if self.pending_request:
            yield self.pending_request.pop(0)

    def parseAd(self, response):
        field_1 = get_field_1()
        field_n = get_field_n()

        if self.pending_request:
            yield self.pending_request.pop(0)

Upvotes: 1

Svickie7
Svickie7

Reputation: 32

Add these setting:

DOWNLOAD_DELAY

Default: 0

DOWNLOAD_DELAY = 0.25 # 250 ms of delay

But scrapy also has a feature to automatically set download delays called AutoThrottle. It automatically sets delays based on load of both the Scrapy server and the website you are crawling. This works better than setting an arbitrary delay.

Upvotes: 0

Related Questions