Reputation: 4232
xpath sentences:
item['title'] = response.xpath('//span[@class="title"]/text()').extract_first()
item['content'] = response.xpath('//div[@class="content"]').extract_first()
results:
{
'title': '\t史蒂芬霍金',
'content': '<div class="content"><div>能够在过去这么多年的时间里研究并学习宇宙学<br>\r\n对我来说意义非凡</div></div>'
}
questions:
1、How to remove \t
in title
field?
2、How to remove <div class="content"></div>
in content
field?(the children nodes can not be removed.)
Upvotes: 0
Views: 52
Reputation: 12168
item['content'] = response.xpath('string(//div[@class="content"])').extract_first()
string()
will concatenate all the text in the current node.
if you want to get rid of white space, you can use normalize-space()
, it's like python's strip()
that built on top of string()
:
item['content'] = response.xpath('normalize-space(//div[@class="content"])').extract_first()
Upvotes: 1
Reputation: 20748
You can use Python's strip()
for title:
item['title'] = response.xpath(
'//span[@class="title"]/text()').extract_first().strip()
And you can chain your selector with XPath's string()
or normalize-space()
for content:
item['content'] = response.xpath(
'//div[@class="content"]').xpath('string(.)').extract_first()
Upvotes: 1