LearnAWK
LearnAWK

Reputation: 549

Xpath: why normalize-space could not remove the empty space and \n?

For the following code:

<a class="title" href="the link">
Low price
<strong>computer</strong>
you should not miss
</a>

I used this xpath code to scrapy:

response.xpath('.//a[@class="title"]//text()[normalize-space()]').extract()

I got the following result:

u'\n                  \n                  Low price ', u'computer', u' you should not miss'

Why two \n and many empty spaces before low price was not removed by normalize-space() for this example?

Another question: how to combine the 3 parts as one scraped item as u'Low price computer you should not miss'?

Upvotes: 6

Views: 2994

Answers (3)

user_1330
user_1330

Reputation: 504

I already had the same problem, try this:

[item.strip() for item in response.xpath('.//a[@class="title"]//text()').extract()]

Upvotes: 2

Michael Kay
Michael Kay

Reputation: 163418

Your call to normalize-space() is in a predicate. That means you are selecting text nodes where (the effective boolean value of) normalize-space() is true. You aren't selecting the result of normalize-space: for that you would want

.//a[@class="title"]//text()/normalize-space()

(which needs XPath 2.0)

The second part of your question: just use

string(.//a[@class="title"])

(assuming scrapy-spider allows you to use an XPath expression that returns a string, rather than one that returns nodes).

Upvotes: 0

Alexander Petrov
Alexander Petrov

Reputation: 14231

Please try this:

'normalize-space(.//a[@class="title"])'

Upvotes: 7

Related Questions