Reputation: 8501
I'm trying to collect a few pieces of information about a bunch of different web sites. I want to produce one Item
per site that summarizes the information I found across that site, regardless of which page(s) I found it on.
I feel like this should be an item pipeline, like the duplicates filter example, except I need the final contents of the Item
, not the results from the first page the crawler examined.
So I tried using request.meta
to pass a single partially-filled Item
through the various Request
s for a given site. To make that work, I had to have my parse callback return exactly one new Request
per call until it had no more pages to visit, then finally return the finished Item
. Which is a pain if I find multiple links I want to follow, and breaks entirely if the scheduler throws away one of the requests due to a link cycle.
The only other approach I can see is to dump the spider output to json-lines and post-process it with an external tool. But I'd prefer to fold it into the spider, preferably in a middleware or item pipeline. How can I do that?
Upvotes: 8
Views: 1568
Reputation: 383
I have the same issue and now it can be easily solved with Item Pipeline:
class AggregatedPipeline:
items = []
def close_spider(self, spider):
# Do something with all the items here
pass
def process_item(self, item, spider):
self.items.append(item)
return item
In case of huge data the intermediate result can be persisted in a file or db instead of memory.
Upvotes: 0
Reputation: 473933
How about this ugly solution?
Define a dictionary (defaultdict(list)) on a pipeline for storing per-site data. In process_item you can just append a dict(item) to the list of per-site items and raise DropItem exception. Then, in close_spider method, you can dump the data to whereever you want.
Should work in theory, but I'm not sure that this solution is the best one.
Upvotes: 5
Reputation: 6627
If you want summary, Stats Collection would be another approach http://doc.scrapy.org/en/0.16/topics/stats.html
for example:
to get total page crawled in each of websites. use the following code.
stats.inc_value('pages_crawled:%s'%socket.gethostname())
Upvotes: 0