Reputation: 23
Below scrapy code (taken from one blog post) working fine to scrap data from first page only. I added "Rule" to extract data from second page but still it takes the data from first page only.
Any advise?
Here is the code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TfawItem
class MasseffectSpider(CrawlSpider):
name = "massEffect"
allowed_domains = ["tfaw.com"]
start_urls = [
'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
]
rules = (
Rule(LinkExtractor(allow=(),
restrict_xpaths=('//div[@class="small-corners-light"][1]/table/tbody/tr[1]/td[2]/a[@class="regularlink"]',)),
callback='parse', follow=True),
)
def parse(self, response):
for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_detail_page)
pass
def parse_detail_page(self, response):
comic = TfawItem()
comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
comic['url'] = response.url
yield comic
Upvotes: 1
Views: 991
Reputation: 21436
There are few problems here with your spider. First you are overriding parse()
method which is reserved by crawlspider, as per documentation:
When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.
Now the second problem is that your LinkExtractor extracts nothing. Your xpath in particular does nothing here.
I would recomment not to use CrawlSpider at all and just go with base scrapy.Spider like this:
import scrapy
class MySpider(scrapy.Spider):
name = 'massEffect'
start_urls = [
'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
]
def parse(self, response):
# parse all items
for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_detail_page)
# do next page
next_page = response.xpath("//a[contains(text(),'next page')]/@href").extract_first()
if next_page:
yield Request(response.urljoin(next_page), callback=self.parse)
def parse_detail_page(self, response):
comic = dict()
comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
comic['url'] = response.url
yield comic
Upvotes: 1