Reputation: 5221
Now I can see that scrapy downloads all pages concurrently, but what I need is to chain people
and extract_person
methods, so that when I get list of persons urls in method people
I follow all of them and scrape all info I need and only after that I continue with another page people urls. How can I do that?
def people(self, response):
sel = Selector(response)
urls = sel.xpath(XPATHS.URLS).extract()
for url in urls:
yield Request(
url=BASE_URL+url,
callback=self.extract_person,
)
def extract_person(self, response):
sel = Selector(response)
name = sel.xpath(XPATHS.NAME).extract()[0]
person = PersonItem(name=name)
yield student
Upvotes: 3
Views: 1683
Reputation: 473863
You can control the priority of the requests:
priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.
Setting the priority for person requests to 1
will let Scrapy know to process them first:
for url in student_urls:
yield Request(
url=BASE_URL+url,
callback=self.extract_person,
priority=1
)
Upvotes: 3