Usman Rafiq
Usman Rafiq

Reputation: 590

Get value of text (with no tag) in scrapy

I am trying to get the value of text (with no tag like <p>,<a> etc.) from this link

https://timesofindia.indiatimes.com/us/donald-trump-boris-johnson-talk-5g-and-trade-ahead-of-g7-white-house/articleshow/70504270.cms

So far I have used scrapy shell to get their values using this code

 item=response.xpath("//div[@class='Normal']/text()").extract()

Or

item=response.css('arttextxml *::text').extract()

The problem is that I am getting values when I use these commands in Scrapy Shell but when I use in my scrapy spyder file it return null value

Is there any solution for this problem?

Upvotes: 0

Views: 289

Answers (2)

Tony Montana
Tony Montana

Reputation: 1018

there are multiple problems with your code.

First, it is messy. Second, the CSS selector you are using to get all link to the news articles, giving the same URL more than once. Third, as per your code, in scrapy.Request method calling, you used self.parseNews as a callback method, which is not even available in the whole file.

I have fixed your code on some level and right now, I am not facing any issue with it.

# -*- coding: utf-8 -*-
import scrapy


class TimesofindiaSpider(scrapy.Spider):
    name = 'timesofindia'
    allowed_domains = ["timesofindia.indiatimes.com"]
    start_urls = ["https://timesofindia.indiatimes.com/World"]
    base_url = "https://timesofindia.indiatimes.com/"

    def parse(self, response):        
        for urls in response.css('div.top-newslist > ul > li'):
            url = urls.css('a::attr(href)').extract_first()
            yield scrapy.Request(self.base_url + url, callback = self.parse_save)

    def parse_save(self, response):
        print(response.xpath("//div[@class='Normal']/text()").extract())

Upvotes: 1

Tauqeer Sajid
Tauqeer Sajid

Reputation: 101

I write a simple spider for you. You get your desired output. Also show your code so i can correct you what you are doing wrong.

Scraper

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['timesofindia.indiatimes.com']
    start_urls = ['https://timesofindia.indiatimes.com/us/donald-trump-boris-johnson-talk-5g-and-trade-ahead-of-g7-white-house/articleshow/70504270.cms']

    def parse(self, response):
        item = response.xpath('//div[@class="Normal"]/text()').extract()

        yield{'Item':item}

Upvotes: 0

Related Questions