Dmitrii Mikhailov
Dmitrii Mikhailov

Reputation: 5221

Chaining requests with scrapy

Now I can see that scrapy downloads all pages concurrently, but what I need is to chain people and extract_person methods, so that when I get list of persons urls in method people I follow all of them and scrape all info I need and only after that I continue with another page people urls. How can I do that?

def people(self, response):
    sel = Selector(response)
    urls = sel.xpath(XPATHS.URLS).extract()
    for url in urls:
        yield Request(
            url=BASE_URL+url,
            callback=self.extract_person,
        )

def extract_person(self, response):
    sel = Selector(response)
    name = sel.xpath(XPATHS.NAME).extract()[0]
    person = PersonItem(name=name)
    yield student

Upvotes: 3

Views: 1683

Answers (1)

alecxe
alecxe

Reputation: 473863

You can control the priority of the requests:

priority (int) – the priority of this request (defaults to 0). The priority is used by the scheduler to define the order used to process requests. Requests with a higher priority value will execute earlier. Negative values are allowed in order to indicate relatively low-priority.

Setting the priority for person requests to 1 will let Scrapy know to process them first:

for url in student_urls:
    yield Request(
        url=BASE_URL+url,
        callback=self.extract_person,
        priority=1
    )

Upvotes: 3

Related Questions