Mantas Lukosevicius
Mantas Lukosevicius

Reputation: 2114

Scrapy/Python: run logic after yielded requests are finished

What I do:

def parse(self, response):

    products_urls = response.css('.product-item a::attr(href)').extract()

    for product_url in product_urls:
        yield Request(product_url, callback=self.parse_product)

    print( "Continue doing stuff...." )


def parse_product(self, response):
    title = response.css('h1::text').extract_first()
    print( title )
}

In this example, the code will first output Continue doing stuff.. and after that it will print product titles. I would like it to run otherwise, first do requests and print titles, and only then print Continue doing stuff..

UPDATE: @Georgiy in comments asked if I require previously scraped product data.

Answer is yes, this is simplified example. After data is fetched I want to manipulate that data.

Upvotes: 2

Views: 279

Answers (2)

Aniketh Reddimi
Aniketh Reddimi

Reputation: 23

Note: Can't comment due to lack of rep.

While the above code works for most cases, I suggest using self.crawler.stats to keep track of count because manually decrementing count at higher concurrent requests might induce a race condition. Example code below.

    def parse(self, response):
        products_urls = response.css('.product-item a::attr(href)').extract()

        self.count = len(products_urls)
        self.crawler.stats.set_value('processed_product_pages', 0)
        if self.count == 0:
            self.onEnd()
        else:
            for product_url in product_urls:
                yield Request(product_url, callback=self.parse_product)

    def onEnd(self):
        print( "Continue doing stuff...." )


    def parse_product(self, response):
        title = response.css('h1::text').extract_first()
        print( title )
        self.crawler.stats.inc_value('processed_product_pages')
        if (self.count == self.crawler.stats.get_value('processed_product_pages', 0)):
            self.onEnd()

Upvotes: 0

napuzba
napuzba

Reputation: 6298

You can move the logic to the parse_product function. For example:

    def parse(self, response):
        products_urls = response.css('.product-item a::attr(href)').extract()

        self.count = len(products_urls)
        if self.count == 0:
            self.onEnd()
        else:
            for product_url in product_urls:
                yield Request(product_url, callback=self.parse_product)

    def onEnd(self):
        print( "Continue doing stuff...." )


    def parse_product(self, response):
        title = response.css('h1::text').extract_first()
        print( title )
        self.count -= 1
        if (self.count == 0):
            self.onEnd()

Upvotes: 3

Related Questions