Scrapy - Wait for ALL requests to be completed

Question

I just started using scrapy and I am trying to scrape several links which when yields my JSON result. Simple enough, but my problem is with the asynchoronous nature of the requests. I am having trouble figuring out the proper structure to achieve this.

Everything works well in the following code except yield Items at the end of the parse method. This value is yeilded before any/all the requests are complete. Is there a way to say "wait for all requests to be completed" then yield. Or an "on finish scraping" method where I can can retrieve the final result?

class SpiderCrawler(scrapy.Spider):
    name = "spiderman"
    allowed_domains = ["mywebsite.com"]
    start_urls = [
        "https://www.mywebsite.com/items",
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@id="col"]'):
            items = MyItem()
            items['categories'] = []

            sections = sel.xpath('//tbody')
            category_count = 5 #filler

            for count in range(1, category_count):
                category = Category()
                #set categories
                for item, link in zip(items.xpath("text()"), items.xpath("@href")):
                    subItem = SubItem()
                    #set subItems
                    subItem['link'] = "www.mywebsite.com/nexturl"

                    #the problem
                    request = scrapy.Request(subItem['link'], callback=self.parse_sub_item)
                    request.meta['sub_item'] = subItem 
                    yield request

                    category['sub_items'].append(subItem)
                items['categories'].append(category)

        #I want this yield to not be executed until ALL requests are complete
        yield items

    def parse_sub_item(self, response):
        fields = #some xpath
        subItem = response.meta["sub_item"]
        subItem['fields'] = #some xpath
        subItem['another_field'] = #some xpath

GHajba · Accepted Answer

The idea behind Scrapy is to export some items per request. What you are doing is you want everything together and return only one item -- and this is this way not possible.

However you can achieve what you want but with a bit altering of your code. Export the items as they are currently and create an item pipeline for example which converts those items you yield in the parse method into one big item (dictionary?) containing the categories and their sub_items and export everything together when the close_spider method is called.

In this case you can handle async item processing and group your results together.

Scrapy - Wait for ALL requests to be completed

Answers (1)

Related Questions