Reputation: 549
For the following code:
<a class="title" href="the link">
Low price
<strong>computer</strong>
you should not miss
</a>
I used this xpath code to scrapy:
response.xpath('.//a[@class="title"]//text()[normalize-space()]').extract()
I got the following result:
u'\n \n Low price ', u'computer', u' you should not miss'
Why two \n
and many empty spaces before low price
was not removed by normalize-space()
for this example?
Another question: how to combine the 3 parts as one scraped item as u'Low price computer you should not miss'
?
Upvotes: 6
Views: 2994
Reputation: 504
I already had the same problem, try this:
[item.strip() for item in response.xpath('.//a[@class="title"]//text()').extract()]
Upvotes: 2
Reputation: 163418
Your call to normalize-space() is in a predicate. That means you are selecting text nodes where (the effective boolean value of) normalize-space()
is true. You aren't selecting the result of normalize-space: for that you would want
.//a[@class="title"]//text()/normalize-space()
(which needs XPath 2.0)
The second part of your question: just use
string(.//a[@class="title"])
(assuming scrapy-spider allows you to use an XPath expression that returns a string, rather than one that returns nodes).
Upvotes: 0
Reputation: 14231
Please try this:
'normalize-space(.//a[@class="title"])'
Upvotes: 7