How to start a new request after the item_scraped scrapy signal is called?

Question

I need to scrap the data of each item from a website using Scrapy(http://example.com/itemview). I have a list of itemID and I need to pass it in a form in example.com. There is no url change for each item. So for each request in my spider the url will always be the same. But the content will be different.

I don't wan't a for loop for handling each request. So i followed the below mentioned steps.

started spider with the above url
added item_scraped and spider_closed signals
passed through several functions
passed the scraped data to pipeline
trigerred the item_scraped signal

After this it automatically calls the spider_closed signal. But I want the above steps to be continued till the total itemID are finished.

class ExampleSpider(scrapy.Spider):
  name = "example"
  allowed_domains = ["example.com"]
  itemIDs = [11111,22222,33333]
  current_item_num = 0

  def __init__(self, itemids=None, *args, **kwargs):
    super(ExampleSpider, self).__init__(*args, **kwargs)
    dispatcher.connect(self.item_scraped, signals.item_scraped)
    dispatcher.connect(self.spider_closed, signals.spider_closed)

  def spider_closed(self, spider):
     self.driver.quit()

  def start_requests(self):
     request = self.make_requests_from_url('http://example.com/itemview')
     yield request

 def parse(self,response):
     self.driver = webdriver.PhantomJS()
     self.driver.get(response.url)
     first_data = self.driver.find_element_by_xpath('//div[@id="itemview"]').text.strip()
     yield Request(response.url,meta={'first_data':first_data},callback=self.processDetails,dont_filter=True)

 def processDetails(self,response):
     itemID = self.itemIDs[self.current_item_num]
     ..form submission with the current itemID goes here...
     ...the content of the page is updated with the given itemID...
     yield Request(response.url,meta={'first_data':response.meta['first_data']},callback=self.processData,dont_filter=True)

def processData(self,response):
    ...some more scraping goes here...
   item = ExamplecrawlerItem()
   item['first_data'] = response.meta['first_data']
   yield item

def item_scraped(self,item,response,spider):
   self.current_item_num += 1
   #i need to call the processDetails function here for the next itemID 
   #and the process needs to contine till the itemID finishes
   self.parse(response)

My piepline:

class ExampleDBPipeline(object):
  def process_item(self, item, spider):
     MYCOLLECTION.insert(dict(item))
     return

How to start a new request after the item_scraped scrapy signal is called?

Answers (1)

Related Questions