Overflow
Overflow

Reputation: 23

Scrapy - Scraping data from first page only not from "Next" page in pagination

Below scrapy code (taken from one blog post) working fine to scrap data from first page only. I added "Rule" to extract data from second page but still it takes the data from first page only.

Any advise?

Here is the code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TfawItem


class MasseffectSpider(CrawlSpider):
    name = "massEffect"
    allowed_domains = ["tfaw.com"]
    start_urls = [
        'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
    ]

    rules = (
        Rule(LinkExtractor(allow=(),
                           restrict_xpaths=('//div[@class="small-corners-light"][1]/table/tbody/tr[1]/td[2]/a[@class="regularlink"]',)),
             callback='parse', follow=True),
    )

    def parse(self, response):
        for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_detail_page)
        pass

    def parse_detail_page(self, response):
        comic = TfawItem()
        comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
        comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
        comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
        comic['url'] = response.url
        yield comic

Upvotes: 1

Views: 991

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

There are few problems here with your spider. First you are overriding parse() method which is reserved by crawlspider, as per documentation:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

Now the second problem is that your LinkExtractor extracts nothing. Your xpath in particular does nothing here.

I would recomment not to use CrawlSpider at all and just go with base scrapy.Spider like this:

import scrapy
class MySpider(scrapy.Spider):
    name = 'massEffect'
    start_urls = [
        'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
    ]

    def parse(self, response):
        # parse all items
        for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_detail_page)
        # do next page
        next_page = response.xpath("//a[contains(text(),'next page')]/@href").extract_first()
        if next_page:
            yield Request(response.urljoin(next_page), callback=self.parse)

    def parse_detail_page(self, response):
        comic = dict()
        comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
        comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
        comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
        comic['url'] = response.url
        yield comic

Upvotes: 1

Related Questions