zwl1619
zwl1619

Reputation: 4232

about xpath when using scrapy

xpath sentences:

item['title'] = response.xpath('//span[@class="title"]/text()').extract_first()
item['content'] = response.xpath('//div[@class="content"]').extract_first()

results:

{
'title': '\t史蒂芬霍金',
'content': '<div class="content"><div>能够在过去这么多年的时间里研究并学习宇宙学<br>\r\n对我来说意义非凡</div></div>'
}

questions:

1、How to remove \t in title field?
2、How to remove <div class="content"></div> in content field?(the children nodes can not be removed.)

Upvotes: 0

Views: 52

Answers (2)

宏杰李
宏杰李

Reputation: 12168

item['content'] = response.xpath('string(//div[@class="content"])').extract_first()

string() will concatenate all the text in the current node.

if you want to get rid of white space, you can use normalize-space(), it's like python's strip() that built on top of string():

item['content'] = response.xpath('normalize-space(//div[@class="content"])').extract_first()

Upvotes: 1

paul trmbrth
paul trmbrth

Reputation: 20748

You can use Python's strip() for title:

item['title'] = response.xpath(
                    '//span[@class="title"]/text()').extract_first().strip()

And you can chain your selector with XPath's string() or normalize-space() for content:

item['content'] = response.xpath(
                      '//div[@class="content"]').xpath('string(.)').extract_first()

Upvotes: 1

Related Questions