Scrapy combine items from multiple processes

Question

I have a scrapy script that

Finds all 'pages' nodes in an xml file
Parses all those pages, collects data, finds additional pages
Additional pages are further parsed and information is collected

Scrapy script:

class test_spider(XMLFeedSpider):
 name='test'
 start_urls=['https://www.example.com'] 
 custom_settings={
  'ITEM_PIPELINES':{
   'test.test_pipe': 100,
  },
 }
 itertag='pages'  
 def parse1(self,response,node):
  yield Request('https://www.example.com/'+node.xpath('@id').extract_first()+'/xml-out',callback=self.parse2)
 def parse2(self,response):
  yield{'COLLECT1':response.xpath('/@id').extract_first()} 
  for text in string.split(response.xpath(root+'/node[@id="page"]/text()').extract_first() or '','^'):
   if text is not '':
    yield Request(
     'https://www.example.com/'+text,
     callback=self.parse3,
     dont_filter=True
    )
 def parse3(self,response):
  yield{'COLLECT2':response.xpath('/@id').extract_first()} 
class listings_pipe(object):
 def process_item(self,item,spider):
  pprint(item)

Ideal result would be combined dict item such as

{'COLLECT1':'some data','COLLECT2':['some data','some data',...]}

Is there a way to call pipeline after each parse1 event? and get combined dict of items?

ThunderMind · Accepted Answer

In your Parse2 method, use meta and pass you collection1 to parse3 using meta. Then in Parse3 acquire your collection1, extract your collection2 and yield combine result as you wish.

For more info on meta you can read here

Scrapy combine items from multiple processes

Answers (1)

Related Questions