Reputation: 4151
I need help doing a few things with XPath in PHP.
With any given HTML, I need to:
With regex, I got everything working perfectly. When I encountered nested tables, however, I decided that it is indeed foolish to parse HTML with regex.
Thanks so much!
Upvotes: 1
Views: 252
Reputation: 243479
With any given HTML, I need to:
• Remove all tables and their contents
• Remove everything after the first h1 tag
• Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc))
This can be done very easily with XSLT:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:h="http://www.w3.org/1999/xhtml" >
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- Copy every node except when overriden
by another template -->
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<!-- Remove all tables and their contents -->
<xsl:template match="h:table"/>
<!-- Remove everything after the first h1 -->
<xsl:template match="node()[preceding::h:h1]"/>
<!-- Keep only paragraphs (INCLUDING
their inner HTML (links, lists, etc))
-->
<xsl:template match=
"node()[not(self::h:p) and not(ancestor::h:p)]">
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
In case your element names are not in the XHtml namespace, simple delete any occurence of h:
in the above code.
Upvotes: 1
Reputation: 10847
Consider using HTML DOM parsers as this will be much easier then XML. There are some parsers that support xpath statements as well. But the tricky part is that not all HTML conforms to strict xhtml standards so the rules are not always easy to apply. Here are a couple DOM parsers I came across. Some support xpath and some just have other ways of selecting content:
http://simplehtmldom.sourceforge.net/
http://php.net/manual/en/simplexmlelement.xpath.php
Upvotes: 0