Echchama Nayak
Echchama Nayak

Reputation: 933

normalize-space not working on scrapy

I am trying to extract chapter titles and their subtitles from a web page in the url. This is my spider

import scrapy
from ..items import ContentsPageSFBItem

class BasicSpider(scrapy.Spider):
    name = "contentspage_sfb"
    #allowed_domains = ["web"]
    start_urls = [
        'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/',
    ]

    def parse(self, response):
            item = ContentsPageSFBItem()
            item['content_item'] = response.xpath('normalize-space(//ol[@class="detail-toc"]//*/text())').extract();
            length = len(response.xpath('//ol[@class="detail-toc"]//*/text()').extract()); #extract()
            full_url_list = list();
            title_list = list();
            for i in range(1,length+1):
                full_url_list.append(response.url)  
            item["full_url"] = full_url_list
            title = response.xpath('//title[1]/text()').extract();
            for j in range(1,length+1):
                title_list.append(title)  
            item["title"] = title_list
            return item

Even though I use the normalise fucntion in my xpath to remove the spaces, I get the following result in my csv

content_item,full_url,title
"

      ,Chapter 1,



      ,


  ,

      ,Instructor Introduction,

      ,00:01:00,



  ,

  ,

      ,Course Overview,

How do I get the result with at most only one new line after each entry?

Upvotes: 1

Views: 445

Answers (1)

vold
vold

Reputation: 1549

If you want to get all text within Table of Contents section you need to change your xpath expression in item['content_item'] to:

item['content_item'] = response.xpath('//ol[@class="detail-toc"]//a/text()').extract()

You can rewrite you spider code like this:

import scrapy

class BasicSpider(scrapy.Spider):

    name = "contentspage_sfb"
    start_urls = [
        'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/',
    ]

    def parse(self, response):
        item = dict()     # change dict to your scrapy item
        for link in response.xpath('//ol[@class="detail-toc"]//a'):
            item['link_text'] = link.xpath('text()').extract_first()
            item['link_url'] = response.urljoin(link.xpath('@href').extract_first())
            yield item

# Output:
{'link_text': 'About This E-Book', 'link_url': 'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/pref00.html#pref00'}
{'link_text': 'Title Page', 'link_url': 'https://www.safaribooksonline.com/library/view/shell-programming-in/9780134496696/title.html#title'}

Upvotes: 1

Related Questions