Reputation: 13
Hi I am a bit of a noob in scrapy.I was trying to crawl articles(content,agency name,correspondent etc.) from the following page: http://timesofindia.indiatimes.com/topic/Startup
The problem is my spider returns the correct results for most of the articles,but for articles where the agency name is "reuters"(e.g - http://timesofindia.indiatimes.com/business/international-business/novartis-roche-back-french-gene-therapy-start-up-vivet/articleshow/58511702.cms), it only returns a bunch of escape characters instead of the content(it does return the headline and agency name though).Here's my xpath variables:
main_path=response.xpath('//div[@class="main-content"]')
yield {
'Headline':"".join(main_path.xpath('.//h1[@class="heading1"]/text()').extract(),
'Correspondent':"".join(main_path.xpath('.//span[@class="auth_detail"]/text()').extract()),
'Agency':"".join(main_path.xpath('.//span[@itemprop="name"]/text()').extract()),
'ArticleContent':(main_path.xpath('.//div[@class="Normal"]/text()').extract()),
}
Could you guys help me figure out why would I be facing this issue? Thanks
Upvotes: 1
Views: 48
Reputation: 2286
Solution: insert a second /
before text()
into your xpath
'ArticleContent':(main_path.xpath('.//div[@class="Normal"]//text()').extract()),
Explanation
Reuters has additional <p>
tags in their article content. While ../text()
captures only text within the same node/tag ..//text()
does so for sub-tags / sub-nodes, too.
Upvotes: 1