Reputation: 7294
This is not issues for me, I can live without it but I am just curious is it possible and how.
Today I have learned, that scrapy.Request
will not finish in the same order as they have been started.
Pseudocode example:
class SomeSpider(scrapy.Spider):
def parse(self, response):
# get all ads(25) from ads list
for ad in adList():
add_url = findAddUrl()
yield scrapy.Request(add_url, callback=self.parseAd)
# go to next page
if some_condition_OK:
next_page_url = findNextpageUrl()
yield scrapy.Request(next_page_url)
else:
print 'Stoped at.'
def parseAd(self, response):
field_1 = get_field_1()
field_n = get_field_n()
# save field_1 to field_n to sqlite DB
This is a simplified example of spider that I have coded and it is working fine.
But what I have learned today is that yield scrapy.Request
will not finish in the same order as they are started.
In my example, on each page, there are 25 ads for each page, I start yield scrapy.Request(add_url, callback=self.parseAd)
to get more information from each ad.
And after that, I go to next page with yield scrapy.Request(next_page_url)
.
But what I have noticed that some ads from page2 will finish before the all ads from page1.
I understand why and I see the benefit of this approach.
But my question is it possible to make scrapy.Request
deterministic?
What I mean by deterministic is that each scrapy.Request
will finish in the same order as it is started.
Upvotes: 1
Views: 155
Reputation: 1981
The only way to make Scrapy deterministic is to yield only one requests at the same time, while keeping the rest of them in a list or queue:
class SomeSpider(scrapy.Spider):
pending_request = []
def parse(self, response):
# get all ads(25) from ads list
for ad in adList():
add_url = findAddUrl()
self.pending_request.append(
scrapy.Request(add_url, callback=self.parseAd))
# go to next page
if some_condition_OK:
next_page_url = findNextpageUrl()
self.pending_request.append(scrapy.Request(next_page_url))
else:
print 'Stoped at.'
if self.pending_request:
yield self.pending_request.pop(0)
def parseAd(self, response):
field_1 = get_field_1()
field_n = get_field_n()
if self.pending_request:
yield self.pending_request.pop(0)
Upvotes: 1
Reputation: 32
Add these setting:
DOWNLOAD_DELAY
Default: 0
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
But scrapy also has a feature to automatically set download delays called AutoThrottle. It automatically sets delays based on load of both the Scrapy server and the website you are crawling. This works better than setting an arbitrary delay.
Upvotes: 0