Cipher
Cipher

Reputation: 15

How to start a new request after the item_scraped scrapy signal is called?

I need to scrap the data of each item from a website using Scrapy(http://example.com/itemview). I have a list of itemID and I need to pass it in a form in example.com. There is no url change for each item. So for each request in my spider the url will always be the same. But the content will be different.

I don't wan't a for loop for handling each request. So i followed the below mentioned steps.

After this it automatically calls the spider_closed signal. But I want the above steps to be continued till the total itemID are finished.

class ExampleSpider(scrapy.Spider):
  name = "example"
  allowed_domains = ["example.com"]
  itemIDs = [11111,22222,33333]
  current_item_num = 0

  def __init__(self, itemids=None, *args, **kwargs):
    super(ExampleSpider, self).__init__(*args, **kwargs)
    dispatcher.connect(self.item_scraped, signals.item_scraped)
    dispatcher.connect(self.spider_closed, signals.spider_closed)

  def spider_closed(self, spider):
     self.driver.quit()

  def start_requests(self):
     request = self.make_requests_from_url('http://example.com/itemview')
     yield request

 def parse(self,response):
     self.driver = webdriver.PhantomJS()
     self.driver.get(response.url)
     first_data = self.driver.find_element_by_xpath('//div[@id="itemview"]').text.strip()
     yield Request(response.url,meta={'first_data':first_data},callback=self.processDetails,dont_filter=True)

 def processDetails(self,response):
     itemID = self.itemIDs[self.current_item_num]
     ..form submission with the current itemID goes here...
     ...the content of the page is updated with the given itemID...
     yield Request(response.url,meta={'first_data':response.meta['first_data']},callback=self.processData,dont_filter=True)

def processData(self,response):
    ...some more scraping goes here...
   item = ExamplecrawlerItem()
   item['first_data'] = response.meta['first_data']
   yield item

def item_scraped(self,item,response,spider):
   self.current_item_num += 1
   #i need to call the processDetails function here for the next itemID 
   #and the process needs to contine till the itemID finishes
   self.parse(response)

My piepline:

class ExampleDBPipeline(object):
  def process_item(self, item, spider):
     MYCOLLECTION.insert(dict(item))
     return

Upvotes: 0

Views: 702

Answers (1)

Will Madaus
Will Madaus

Reputation: 169

I wish I had an elegant solution to this. But instead it's a hackish way of calling the underlying classes.

                self.crawler.engine.slot.scheduler.enqueue_request(scrapy.Request(url,self.yourCallBack))

However, you can yield a request after you yield the item and have it callback to self.processDetails. Simply add this to your processData function:

   yield item
   self.counter += 1
   yield scrapy.Request(response.url,callback=self.processDetails,dont_filter=True, meta = {"your":"Dictionary"}

Also, PhantomJS can be nice and make your life easy, but it is slower than regular connections. If possible, find the request for json data or whatever makes the page unparseable without JS. To do so, open up chrome, right click, click inspect, go to the network tab, then enter the ID into the form, then look at the XHR or JS tabs for a JSON that has the data or next url you want. Most of the time, there will be some url made by adding the ID, if you can find it, you can just concatenate your urls and call that directly without having the cost of JS rendering. Sometimes it is randomized, or not there, but I've had fair success with it. You can then also use that to yield many requests at the same time without having to worry about phantomJS trying to do two things at once or having to initialize many instances of it. You could use tabs, but that is a pain.

Also, I would use a Queue of your IDs to ensure thread safety. Otherwise, you could have processDetails called twice on the same ID, though in the logic of your program everything seems to go linearly, which means you aren't using the concurrency capabilities of Scrapy and your program will go more slowly. To use Queue add:

import Queue
#go inside class definition and add
itemIDQueue = Queue.Queue()
#within __init__ add
[self.itemIDQueue.put(ID) for ID in self.itemID]
#within processDetails replace  itemID = self.itemIDs[self.current_item_num] with
itemID = self.itemIDQueue.get()

And then there is no need to increment the counter and your program is thread safe.

Upvotes: 0

Related Questions