Reputation:

Screen Scraping with PHP and XPath

Does anyone know how to maintain text formatting when using XPath to extract data?

I am currently extracting all blocks

<div class="info"> <h5>title</h5> text <a href="somelink">anchor</a> </div>

from a page. The problem is when I access the nodeValue, I can only get plain text. How can I capture the contents including formatting, i.e. the h5 and a still in the code?

Thanks in advance. I have searched every combination imaginable on Google and no luck.

Upvotes: 1

Answers (5)

Dimitre Novatchev

Reputation: 243459

The XPath language is designed to be embedded in another language (such as DOM API, XSLT, XQuery, ...) and cannot be used standalone. The original question does not specify what is the desired embedding.

Below is a very simple and short solution when XPath is embedded in XSLT.

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes"/>

    <xsl:template match="div[@class='info']">
       <xsl:copy-of select="."/>
    </xsl:template>
</xsl:stylesheet>

when applied on this xml document:

<html>
    <body>
        <div class="info">
            <h1>title1</h1> text1
            <a href="somelink1">anchor1</a>
        </div>
        Something else here
        <div class="info">
            <h2>title2</h2> text2
            <a href="somelink2">anchor2</a>
        </div>
        Something else here
        <div class="info">
            <h3>title3</h3> text3
            <a href="somelink3">anchor3</a>
        </div>
    </body>
</html>

produces the wanted result:

<div class="info">
  <h1>title1</h1> text1
    <a href="somelink1">anchor1</a>
</div>
        Something else here
<div class="info">
  <h2>title2</h2> text2
  <a href="somelink2">anchor2</a>
</div>
        Something else here
<div class="info">
  <h3>title3</h3> text3
  <a href="somelink3">anchor3</a>
</div>

Upvotes: 1

null

Reputation: 7594

I would like to add to Ciaran McNulty answer

You can do the same in SimpleXml like:

$simplexml->node->asXml(); // saveXml() is now an alias

And to expand on the quote

The NodeValue of an element is really the textual value, not the structured XML.

You can think of your node as follows:

<div class="info">
    <__toString()> </__toString()>
    <h5>title</h5>
    <__toString()> text </__toString()>
    <a href="somelink">anchor</a>
    <__toString()> </__toString()>
</div>

Where the call to $element->nodeValue is like calling $element->__toString() which would only get the __toString() elements. The imaginary __toString() I created is officially defined as an XML_TEXT_NODE.

Upvotes: 1

phihag

Reputation: 287755

div/node() should do the trick.

Example input:

<div class="info">
  some <h5>title</h5> text <a href="somelink">anchor</a> more text
</div>

Example XSLT stylesheet:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
        <newtag>
                <xsl:copy-of select="div/node()"/>
        </newtag>
</xsl:template>

</xsl:stylesheet>

Example output:

<?xml version="1.0" encoding="utf-8"?>
<newtag> some<h5>title</h5> text <a href="somelink">anchor</a> more text</newtag>

Upvotes: 0

Glen Solsberry

Reputation: 12320

You'll need to make sure your xpath query 'ends' at the <div class="info">. However, because of the way XPath works, you'll still get all of the 'subtags' in separate nodes. You'll just need to concatenate them.

You could also use XPath's join functionality, though, as I haven't used it, I can't say what problems you might run in to.

Upvotes: 0

Ciaran McNulty

Reputation: 18848

If you have it as a DomElement $element as part of a DomDocument $dom then you will want to do something like:

$string = $dom->saveXml($element);

The NodeValue of an element is really the textual value, not the structured XML.

Upvotes: 2

Screen Scraping with PHP and XPath

Answers (5)

Related Questions