Jamey Sharp
Jamey Sharp

Reputation: 8501

How can I group data scraped from multiple pages, using Scrapy, into one Item?

I'm trying to collect a few pieces of information about a bunch of different web sites. I want to produce one Item per site that summarizes the information I found across that site, regardless of which page(s) I found it on.

I feel like this should be an item pipeline, like the duplicates filter example, except I need the final contents of the Item, not the results from the first page the crawler examined.

So I tried using request.meta to pass a single partially-filled Item through the various Requests for a given site. To make that work, I had to have my parse callback return exactly one new Request per call until it had no more pages to visit, then finally return the finished Item. Which is a pain if I find multiple links I want to follow, and breaks entirely if the scheduler throws away one of the requests due to a link cycle.

The only other approach I can see is to dump the spider output to json-lines and post-process it with an external tool. But I'd prefer to fold it into the spider, preferably in a middleware or item pipeline. How can I do that?

Upvotes: 8

Views: 1568

Answers (3)

stasdavydov
stasdavydov

Reputation: 383

I have the same issue and now it can be easily solved with Item Pipeline:

class AggregatedPipeline:
    items = []

    def close_spider(self, spider):
        # Do something with all the items here
        pass

    def process_item(self, item, spider):
        self.items.append(item)        
        return item

In case of huge data the intermediate result can be persisted in a file or db instead of memory.

Upvotes: 0

alecxe
alecxe

Reputation: 473933

How about this ugly solution?

Define a dictionary (defaultdict(list)) on a pipeline for storing per-site data. In process_item you can just append a dict(item) to the list of per-site items and raise DropItem exception. Then, in close_spider method, you can dump the data to whereever you want.

Should work in theory, but I'm not sure that this solution is the best one.

Upvotes: 5

lucemia
lucemia

Reputation: 6627

If you want summary, Stats Collection would be another approach http://doc.scrapy.org/en/0.16/topics/stats.html

for example:

to get total page crawled in each of websites. use the following code.

stats.inc_value('pages_crawled:%s'%socket.gethostname())

Upvotes: 0

Related Questions