Reputation: 4575
I have a scrapy process which successfully parses items and sub-items, but I can't see whether there's a final hook which would allow me to transform the final data result after everything has been parsed, but before it is formatted as output.
My spider is doing something like this:
class MySpider(scrapy.Spider):
def parse(self, response, **kwargs):
for part in [1,2,3]:
url = f'{response.request.url}?part={part}'
yield scrapy.Request(url=url, callback=self.parse_part, meta={'part': part})
def parse_part(self, response, **kwargs)
# ...
for subpart in part:
yield {
'title': self.get_title(subpart),
'tag': self.get_tag(subpart)
}
}
This works well, but I haven't been able to figure out where I can take the complete resulting structure and transform it before outputting it to json (or whatever). I thought maybe I could do this in the process_spider_output
call of Middleware, but this only seems to give me the single items, not the final structure.
Upvotes: 0
Views: 168
Reputation: 3720
You can use this method to do something after the spider has closed:
def spider_closed(self):
However, you won't be able to modify items in the method. To modify items you need to write a custom pipeline. In the pipeline you write a method which gets called every time your spider yields an item. So in the method you could save all items to a list and then transform all items in the list in the Pipeline method close_spider
Read here on how to write your own pipeline
Example
Let's say you want to have all you items as JSON to maybe send a request to an API. You have to activate your pipeline in settings.py
for it to be used.
import json
class MyPipeline:
def __init__(self, *args, **kwargs):
self.items = []
def process_item(self, item, spider):
self.items.append(item)
return item
def close_spider(self, spider):
# In the method to can iterate self.items and transform them to your preference.
json_data = json.dumps(self.items)
print(json_data)
Upvotes: 1