How to yield item only after all links have been followed in Scrapy?

Question

The original code:

class HomepageSpider(BaseSpider):
    name = 'homepage_spider'

    def start_requests(self):
        ...

    def parse(self, response):
        # harvest some data from response
        item = ...

        yield scrapy.Request(
            "https://detail-page",
            callback=self.parse_details,
            cb_kwargs={"item": item}
        )

    def parse_details(self, response, item):
        # harvest details
        ...
        yield item

This is the standard way to follow links on a page. However it has a flaw: if there is an http error (e.g. 503) or connection error when following the 2nd URL, parse_details is never called, and yield item is never executed. And so all data is lost.

Changed code:

class HomepageSpider(BaseSpider):
    name = 'homepage_spider'

    def start_requests(self):
        ...

    def parse(self, response):
        # harvest some data from response
        item = ...

        yield scrapy.Request(
            "https://detail-page",
            callback=self.parse_details,
            cb_kwargs={"item": item}
        )
        yield item

    def parse_details(self, response, item):
        # harvest details
        ...

Changed code does not work, it seems yield item is immediately executed before parse_details is run (perhaps due to Twisted framework, this behavior is different from what's expected in asynio library) and so the item is always yielded with incomplete data.

How to make sure the yield item is executed after all links are followed? regardless of success or failure. Is something like

    res1 = scrapy.Request(...)
    res2 = scrapy.Request(...)

    yield scrapy.join([res1, res2])  # block until both urls are followed?
    yield item

possible?

How to yield item only after all links have been followed in Scrapy?

Answers (1)

Related Questions