Reputation:
Does anyone know how to maintain text formatting when using XPath to extract data?
I am currently extracting all blocks
<div class="info">
<h5>title</h5>
text <a href="somelink">anchor</a>
</div>
from a page. The problem is when I access the nodeValue, I can only get plain text. How can I capture the contents including formatting, i.e. the h5 and a still in the code?
Thanks in advance. I have searched every combination imaginable on Google and no luck.
Upvotes: 1
Views: 2306
Reputation: 243459
The XPath language is designed to be embedded in another language (such as DOM API, XSLT, XQuery, ...) and cannot be used standalone. The original question does not specify what is the desired embedding.
Below is a very simple and short solution when XPath is embedded in XSLT.
This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="div[@class='info']">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>
when applied on this xml document:
<html>
<body>
<div class="info">
<h1>title1</h1> text1
<a href="somelink1">anchor1</a>
</div>
Something else here
<div class="info">
<h2>title2</h2> text2
<a href="somelink2">anchor2</a>
</div>
Something else here
<div class="info">
<h3>title3</h3> text3
<a href="somelink3">anchor3</a>
</div>
</body>
</html>
produces the wanted result:
<div class="info">
<h1>title1</h1> text1
<a href="somelink1">anchor1</a>
</div>
Something else here
<div class="info">
<h2>title2</h2> text2
<a href="somelink2">anchor2</a>
</div>
Something else here
<div class="info">
<h3>title3</h3> text3
<a href="somelink3">anchor3</a>
</div>
Upvotes: 1
Reputation: 7594
I would like to add to Ciaran McNulty answer
You can do the same in SimpleXml like:
$simplexml->node->asXml(); // saveXml() is now an alias
And to expand on the quote
The NodeValue of an element is really the textual value, not the structured XML.
You can think of your node as follows:
<div class="info">
<__toString()> </__toString()>
<h5>title</h5>
<__toString()> text </__toString()>
<a href="somelink">anchor</a>
<__toString()> </__toString()>
</div>
Where the call to $element->nodeValue
is like calling $element->__toString()
which would only get the __toString() elements. The imaginary __toString()
I created is officially defined as an XML_TEXT_NODE
.
Upvotes: 1
Reputation: 287755
div/node()
should do the trick.
Example input:
<div class="info">
some <h5>title</h5> text <a href="somelink">anchor</a> more text
</div>
Example XSLT stylesheet:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<newtag>
<xsl:copy-of select="div/node()"/>
</newtag>
</xsl:template>
</xsl:stylesheet>
Example output:
<?xml version="1.0" encoding="utf-8"?>
<newtag> some<h5>title</h5> text <a href="somelink">anchor</a> more text</newtag>
Upvotes: 0
Reputation: 12320
You'll need to make sure your xpath query 'ends' at the <div class="info">
. However, because of the way XPath works, you'll still get all of the 'subtags' in separate nodes. You'll just need to concatenate them.
You could also use XPath's join functionality, though, as I haven't used it, I can't say what problems you might run in to.
Upvotes: 0
Reputation: 18848
If you have it as a DomElement $element as part of a DomDocument $dom then you will want to do something like:
$string = $dom->saveXml($element);
The NodeValue of an element is really the textual value, not the structured XML.
Upvotes: 2