Reputation: 706
I just started using scrapy and I am trying to scrape several links which when yields my JSON result. Simple enough, but my problem is with the asynchoronous nature of the requests. I am having trouble figuring out the proper structure to achieve this.
Everything works well in the following code except yield Items
at the end of the parse method. This value is yeilded before any/all the requests are complete. Is there a way to say "wait for all requests to be completed" then yield. Or an "on finish scraping" method where I can can retrieve the final result?
class SpiderCrawler(scrapy.Spider):
name = "spiderman"
allowed_domains = ["mywebsite.com"]
start_urls = [
"https://www.mywebsite.com/items",
]
def parse(self, response):
for sel in response.xpath('//div[@id="col"]'):
items = MyItem()
items['categories'] = []
sections = sel.xpath('//tbody')
category_count = 5 #filler
for count in range(1, category_count):
category = Category()
#set categories
for item, link in zip(items.xpath("text()"), items.xpath("@href")):
subItem = SubItem()
#set subItems
subItem['link'] = "www.mywebsite.com/nexturl"
#the problem
request = scrapy.Request(subItem['link'], callback=self.parse_sub_item)
request.meta['sub_item'] = subItem
yield request
category['sub_items'].append(subItem)
items['categories'].append(category)
#I want this yield to not be executed until ALL requests are complete
yield items
def parse_sub_item(self, response):
fields = #some xpath
subItem = response.meta["sub_item"]
subItem['fields'] = #some xpath
subItem['another_field'] = #some xpath
Upvotes: 2
Views: 5389
Reputation: 3691
The idea behind Scrapy is to export some items per request. What you are doing is you want everything together and return only one item -- and this is this way not possible.
However you can achieve what you want but with a bit altering of your code.
Export the items as they are currently and create an item pipeline for example which converts those items you yield in the parse
method into one big item (dictionary?) containing the categories and their sub_items
and export everything together when the close_spider
method is called.
In this case you can handle async item processing and group your results together.
Upvotes: 2