Reputation: 125
My spider function is on a page and I need to go to a link and get some data from that page to add to my item but I need to go to various pages from the parent page without creating more items. How would I go about doing that because from what I can read in the documentation I can only go in a linear fashion:
parent page > next page > next page
But I need to:
parent page > next page
> next page
> next page
Upvotes: 4
Views: 2168
Reputation: 474003
You should return Request
instances and pass item
around in meta
. And you would have to make it in a linear fashion and build a chain of requests and callbacks. In order to achieve it, you can pass around a list of requests for completing an item and return an item from the last callback:
def parse_main_page(self, response):
item = MyItem()
item['main_url'] = response.url
url1 = response.xpath('//a[@class="link1"]/@href').extract()[0]
request1 = scrapy.Request(url1, callback=self.parse_page1)
url2 = response.xpath('//a[@class="link2"]/@href').extract()[0]
request2 = scrapy.Request(url2, callback=self.parse_page2)
url3 = response.xpath('//a[@class="link3"]/@href').extract()[0]
request3 = scrapy.Request(url3, callback=self.parse_page3)
request.meta['item'] = item
request.meta['requests'] = [request2, request3]
return request1
def parse_page1(self, response):
item = response.meta['item']
item['data1'] = response.xpath('//div[@class="data1"]/text()').extract()[0]
return request.meta['requests'].pop(0)
def parse_page2(self, response):
item = response.meta['item']
item['data2'] = response.xpath('//div[@class="data2"]/text()').extract()[0]
return request.meta['requests'].pop(0)
def parse_page3(self, response):
item = response.meta['item']
item['data3'] = response.xpath('//div[@class="data3"]/text()').extract()[0]
return item
Also see:
Upvotes: 4
Reputation:
Using the Scrapy Requests you can perform extra operations on the next URL in the scrapy.Request
's callback .
Upvotes: 1