crawling a specific webpage with Scrapy

Question

Hi I am a bit of a noob in scrapy.I was trying to crawl articles(content,agency name,correspondent etc.) from the following page: http://timesofindia.indiatimes.com/topic/Startup

The problem is my spider returns the correct results for most of the articles,but for articles where the agency name is "reuters"(e.g - http://timesofindia.indiatimes.com/business/international-business/novartis-roche-back-french-gene-therapy-start-up-vivet/articleshow/58511702.cms), it only returns a bunch of escape characters instead of the content(it does return the headline and agency name though).Here's my xpath variables:

main_path=response.xpath('//div[@class="main-content"]')

yield {

'Headline':"".join(main_path.xpath('.//h1[@class="heading1"]/text()').extract(),

'Correspondent':"".join(main_path.xpath('.//span[@class="auth_detail"]/text()').extract()),

'Agency':"".join(main_path.xpath('.//span[@itemprop="name"]/text()').extract()),

'ArticleContent':(main_path.xpath('.//div[@class="Normal"]/text()').extract()),

}

Could you guys help me figure out why would I be facing this issue? Thanks

Done Data Solutions · Accepted Answer

Solution: insert a second / before text() into your xpath

'ArticleContent':(main_path.xpath('.//div[@class="Normal"]//text()').extract()),

Explanation

Reuters has additional

tags in their article content. While ../text() captures only text within the same node/tag ..//text() does so for sub-tags / sub-nodes, too.

crawling a specific webpage with Scrapy

Answers (1)

Related Questions