Nikhil Parmar
Nikhil Parmar

Reputation: 49

scrapy isn't working right in extracting the title

enter image description hereIn this code I want to scrape title,subtitle and data inside the links but having issues on pages beyond 1 and 2 as getting only 1 item scraped.I want to extract only those entries having title as delhivery only

       import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from delhivery.items import DelhiveryItem




class criticspider(CrawlSpider):
    name = "delh"
    allowed_domains = ["consumercomplaints.in"]
    start_urls = ["http://www.consumercomplaints.in/?search=delhivery&page=2"]


    def parse(self, response):
        sites = response.xpath('//table[@width="100%"]')
        items = []

        for site in sites:
            item = DelhiveryItem()
            item['title'] = site.xpath('.//td[@class="complaint"]/a/span[@style="background-color:yellow"]/text()').extract()[0]
            #item['title'] = site.xpath('.//td[@class="complaint"]/a[text() = "%s Delivery Courier %s"]/text()').extract()[0]
            item['subtitle'] = site.xpath('.//td[@class="compl-text"]/div/b[1]/text()').extract()[0]


            item['date'] = site.xpath('.//td[@class="small"]/text()').extract()[0].strip()
            item['username'] = site.xpath('.//td[@class="small"]/a[2]/text()').extract()[0]
            item['link'] = site.xpath('.//td[@class="complaint"]/a/@href').extract()[0]
            if item['link']:
                if 'http://' not in item['link']:
                    item['link'] = urljoin(response.url, item['link'])
                yield scrapy.Request(item['link'],
                                     meta={'item': item},
                                     callback=self.anchor_page)

            items.append(item)

    def anchor_page(self, response):
        old_item = response.request.meta['item']

        old_item['data'] = response.xpath('.//td[@style="padding-bottom:15px"]/div/text()').extract()[0]


        yield old_item

Upvotes: 1

Views: 308

Answers (1)

Vanddel
Vanddel

Reputation: 1094

You need to change the item['title'] to this:

item['title'] = ''.join(site.xpath('//table[@width="100%"]//span[text() = "Delhivery"]/parent::*//text()').extract()[0])

Also edit sites to this to extract the required links only (ones with Delhivery in it)

sites = response.xpath('//table//span[text()="Delhivery"]/ancestor::div')

EDIT: so I understand now that you need to add a pagination rule to your code. it should be something like this: You just need to add your imports and write the new xpaths from the item's link itself, such as this one

class criticspider(CrawlSpider):
    name = "delh"
    allowed_domains = ["consumercomplaints.in"]
    start_urls = ["http://www.consumercomplaints.in/?search=delhivery"]

    rules = (
        # Extracting pages, allowing only links with page=number to be extracted 
        Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]', ), allow=('page=\d+', ),unique=True),follow=True),

         # Extract links of items on each page the spider gets from the first rule
        Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="complaint"]', )), callback='parse_item'),
    )

    def parse_item(self, response):
        item = DelhiveryItem()
        #populate item object here the same way you did, this function will be called for each item link.
        #This meand that you'll be extracting data from pages like this one : 
        #http://www.consumercomplaints.in/complaints/delhivery-last-mile-courier-service-poor-delivery-service-c772900.html#c1880509
        item['title'] = response.xpath('<write xpath>').extract()[0]
        item['subtitle'] = response.xpath('<write xpath>').extract()[0]
        item['date'] = response.xpath('<write xpath>').extract()[0].strip()
        item['username'] = response.xpath('<write xpath>').extract()[0]
        item['link'] = response.url
        item['data'] = response.xpath('<write xpath>').extract()[0]
        yield item

Also I suggest when you write an xpath, that you don't use any styling parameters, try to use @class or @id, only use @width, @style or any styling params if it's the only way.

Upvotes: 1

Related Questions