Reputation: 12538

XPath getting text from all elements that match XPath query

I am having a lot of difficulty constructing a query that will return all the text from all the elements below in one string (assume all other elements on the page contain text as well and are not span or div elements).

Note: Because I am using the PHP XPath engine, I am forced to use a solution that is XPath 1.0.

HTML

<div>Hello</div>
<div>World</div>
<div>!!!</div>
<span>This</span>
<span>is</span>
<span>cool</span>

XPath

normalize-space(//*/div | //*/span)

Desired output:

Hello World!!! This is cool

I appreciate any suggestions. Many thanks in advance!

Upvotes: 1

Answers (4)

hakre

Reputation: 197767

The normalize-space() Xpath 1.0 function does work on a string - not on a node-set. In your example code you have a node-set as it's first parameter:

 normalize-space(//*/div | //*/span)

In such a case, the "string-value of a node-set" is the string value of the first node. So what you do is not fitting to your needs.

To the very best of my knowledge it is not possible to achieve what you're looking for with a single XPath 1.0 query alone. It's possible with the help of PHP however creating the string you're looking for by registering a PHP function that does what you're looking for.

See as well:

Upvotes: 1

Shaun McCance

Reputation: 474

You already have space between the elements, so there's no need to add any, as long as you include it in what you select. If you pass a node set to something that expects a string, XPath converts the node set to a string by just concatenating together all descendant text nodes, in document order. So if the context node is the parent of all these div and span elements, the simplest expression is just

normalize-space(.)

Upvotes: 1

paul trmbrth

Reputation: 20748

Using EXSLT string extensions with lxml (Python) http://www.exslt.org/str/str.html

str:replace(str:concat(//text()), "\n", " ")

or even simpler

normalize-space(str:concat(//text()))

Tested in Python shell

>>> import lxml.etree
>>> import lxml.html
>>> doc="""<div>Hello</div>
... <div>World</div>
... <div>!!!</div>
... <span>This</span>
... <span>is</span>
... <span>cool</span>"""
>>> root = lxml.etree.fromstring(doc, parser=lxml.html.HTMLParser())
>>> root.xpath('str:replace(str:concat(//text()), "\n", " ")', namespaces={"str": "http://exslt.org/strings"})
'Hello World !!! This is cool'
>>> root.xpath('normalize-space(str:concat(//text()))', namespaces={"str": "http://exslt.org/strings"})
'Hello World !!! This is cool'
>>>

Upvotes: 0

alecxe

Reputation: 473873

This works for xpath 2.0:

string-join(/*/text(), ' ')

Tested here, prints:

Hello World !!! This is cool

Upvotes: 0

XPath getting text from all elements that match XPath query

Answers (4)

Related Questions