Reputation: 12538
I am having a lot of difficulty constructing a query that will return all the text from all the elements below in one string (assume all other elements on the page contain text as well and are not span
or div
elements).
Note: Because I am using the PHP XPath engine, I am forced to use a solution that is XPath 1.0.
HTML
<div>Hello</div>
<div>World</div>
<div>!!!</div>
<span>This</span>
<span>is</span>
<span>cool</span>
XPath
normalize-space(//*/div | //*/span)
Desired output:
Hello World!!! This is cool
I appreciate any suggestions. Many thanks in advance!
Upvotes: 1
Views: 598
Reputation: 197767
The normalize-space()
Xpath 1.0 function does work on a string - not on a node-set. In your example code you have a node-set as it's first parameter:
normalize-space(//*/div | //*/span)
In such a case, the "string-value of a node-set" is the string value of the first node. So what you do is not fitting to your needs.
To the very best of my knowledge it is not possible to achieve what you're looking for with a single XPath 1.0 query alone. It's possible with the help of PHP however creating the string you're looking for by registering a PHP function that does what you're looking for.
See as well:
DOMXPath::registerPhpFunctions()
— Register PHP functions as XPath functionsUpvotes: 1
Reputation: 474
You already have space between the elements, so there's no need to add any, as long as you include it in what you select. If you pass a node set to something that expects a string, XPath converts the node set to a string by just concatenating together all descendant text nodes, in document order. So if the context node is the parent of all these div
and span
elements, the simplest expression is just
normalize-space(.)
Upvotes: 1
Reputation: 20748
Using EXSLT string extensions with lxml (Python) http://www.exslt.org/str/str.html
str:replace(str:concat(//text()), "\n", " ")
or even simpler
normalize-space(str:concat(//text()))
Tested in Python shell
>>> import lxml.etree
>>> import lxml.html
>>> doc="""<div>Hello</div>
... <div>World</div>
... <div>!!!</div>
... <span>This</span>
... <span>is</span>
... <span>cool</span>"""
>>> root = lxml.etree.fromstring(doc, parser=lxml.html.HTMLParser())
>>> root.xpath('str:replace(str:concat(//text()), "\n", " ")', namespaces={"str": "http://exslt.org/strings"})
'Hello World !!! This is cool'
>>> root.xpath('normalize-space(str:concat(//text()))', namespaces={"str": "http://exslt.org/strings"})
'Hello World !!! This is cool'
>>>
Upvotes: 0