kakarukeys
kakarukeys

Reputation: 23551

How to yield item only after all links have been followed in Scrapy?

The original code:

class HomepageSpider(BaseSpider):
    name = 'homepage_spider'

    def start_requests(self):
        ...

    def parse(self, response):
        # harvest some data from response
        item = ...

        yield scrapy.Request(
            "https://detail-page",
            callback=self.parse_details,
            cb_kwargs={"item": item}
        )

    def parse_details(self, response, item):
        # harvest details
        ...
        yield item

This is the standard way to follow links on a page. However it has a flaw: if there is an http error (e.g. 503) or connection error when following the 2nd URL, parse_details is never called, and yield item is never executed. And so all data is lost.

Changed code:

class HomepageSpider(BaseSpider):
    name = 'homepage_spider'

    def start_requests(self):
        ...

    def parse(self, response):
        # harvest some data from response
        item = ...

        yield scrapy.Request(
            "https://detail-page",
            callback=self.parse_details,
            cb_kwargs={"item": item}
        )
        yield item

    def parse_details(self, response, item):
        # harvest details
        ...

Changed code does not work, it seems yield item is immediately executed before parse_details is run (perhaps due to Twisted framework, this behavior is different from what's expected in asynio library) and so the item is always yielded with incomplete data.

How to make sure the yield item is executed after all links are followed? regardless of success or failure. Is something like

    res1 = scrapy.Request(...)
    res2 = scrapy.Request(...)

    yield scrapy.join([res1, res2])  # block until both urls are followed?
    yield item

possible?

Upvotes: 1

Views: 394

Answers (1)

wishmaster
wishmaster

Reputation: 1487

you can send the failed requests to a function (whenever an Error happens),yield the item from there.

from scrapy.spidermiddlewares.httperror import HttpError
class HomepageSpider(BaseSpider):
    name = 'homepage_spider'

    def start_requests(self):
        ...

    def parse(self, response):
        # harvest some data from response
        item = ...

        yield scrapy.Request(
            "https://detail-page",
            callback=self.parse_details,
            meta={"item": item},
            errback=self.my_handle_error
        )

    def parse_details(self, response):
        item = response.meta['item']
        # harvest details
        ...
        yield item

    def my_handle_error(self,failure,item):
         response = failure.value.response
        print(f"Error on {response.url}")
        #you can do much depth error checking here to see what type of failure like DNSlookup,timeouterror,httperror ...
        yield item

second Edit to yield the item

yield scrapy.Request(
                "https://detail-page",
                callback=self.parse_details,
                cb_kwargs={"item": item},
                errback=errback=lambda failure, item=item: self.my_handle_error(failure, item)
            )

def my_handle_error(self,failure,item):
    yield item

Upvotes: 1

Related Questions