Reputation: 590
I am trying to get the value of text (with no tag like <p>
,<a>
etc.) from this link
So far I have used scrapy shell to get their values using this code
item=response.xpath("//div[@class='Normal']/text()").extract()
Or
item=response.css('arttextxml *::text').extract()
The problem is that I am getting values when I use these commands in Scrapy Shell but when I use in my scrapy spyder file it return null value
Is there any solution for this problem?
Upvotes: 0
Views: 289
Reputation: 1018
there are multiple problems with your code.
First, it is messy. Second, the CSS selector you are using to get all link to the news articles, giving the same URL more than once. Third, as per your code, in scrapy.Request
method calling, you used self.parseNews
as a callback method, which is not even available in the whole file.
I have fixed your code on some level and right now, I am not facing any issue with it.
# -*- coding: utf-8 -*-
import scrapy
class TimesofindiaSpider(scrapy.Spider):
name = 'timesofindia'
allowed_domains = ["timesofindia.indiatimes.com"]
start_urls = ["https://timesofindia.indiatimes.com/World"]
base_url = "https://timesofindia.indiatimes.com/"
def parse(self, response):
for urls in response.css('div.top-newslist > ul > li'):
url = urls.css('a::attr(href)').extract_first()
yield scrapy.Request(self.base_url + url, callback = self.parse_save)
def parse_save(self, response):
print(response.xpath("//div[@class='Normal']/text()").extract())
Upvotes: 1
Reputation: 101
I write a simple spider for you. You get your desired output. Also show your code so i can correct you what you are doing wrong.
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['timesofindia.indiatimes.com']
start_urls = ['https://timesofindia.indiatimes.com/us/donald-trump-boris-johnson-talk-5g-and-trade-ahead-of-g7-white-house/articleshow/70504270.cms']
def parse(self, response):
item = response.xpath('//div[@class="Normal"]/text()').extract()
yield{'Item':item}
Upvotes: 0