Reputation: 2415
I need to send my requests in order with Scrapy.
def n1(self, response) :
#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
for (elem,) in self.input :
link = urljoin(path,elem)
yield Request(link)
My problem is that the requests are not in the order. I read this question but it has no correct answer.
How should I change my code for sending the requests in order?
UPDATE 1
I used priority and changed my code to
def n1(self, response) :
#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
self.prio = len(self.input)
for (elem,) in self.input :
self.prio -= 1
link = urljoin(path,elem)
yield Request(link, priority=self.prio)
And my setting for this spider is
custom_settings = {
'DOWNLOAD_DELAY' : 0,
'COOKIES_ENABLED' : True,
'CONCURRENT_REQUESTS' : 1 ,
'AUTOTHROTTLE_ENABLED' : False,
}
Now the order is changed, but it's not in the order of elements in the array
Upvotes: 0
Views: 587
Reputation: 1879
Use a return
statement instead of yield
.
You don't even need to touch any setting:
from scrapy.spiders import Spider, Request
class MySpider(Spider):
name = 'toscrape.com'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
for url in self.urls:
return Request(url)
Output:
2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)
2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)
2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)
2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)
2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)
2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)
2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)
2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)
2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)
2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)
2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)
2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)
2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)
2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)
With a yield
statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).
Upvotes: 2
Reputation: 3280
You can send the next request only after receiving the previous one:
class MainSpider(Spider):
urls = [
'https://www.url1...',
'https://www.url2...',
'https://www.url3...',
]
def start_requests(self):
yield Request(
url=self.urls[0],
callback=self.parse,
meta={'next_index': 1},
)
def parse(self, response):
next_index = response.meta['next_index']
# do something with response...
# Process next url
if next_index < len(self.urls):
yield Request(
url=self.urls[next_index],
callback=self.parse,
meta={'next_index': next_index+1},
)
Upvotes: 0
Reputation: 9185
I think concurrent request is play at here. You can try setting
custom_settings = {
'CONCURRENT_REQUESTS': 1
}
Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.
Upvotes: 0