Reputation:
I am trying to extract blog post from web pages. Different pages have different structure so it is very difficult to extract what I need. There are some CSS and JS code in the HTML section also, I have to avoid them.
<h1> a dummy title </h1>
from previous, so it can help to validate the exact one.<body>
<div>
<h1> a dummy title </h1>
</div>
<script> function loadDoc() {const xhttp = new XMLHttpRequest();} </script>
<div class="subtitle">
<p>...</p>
</div>
<div class="blog-post">
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
</div>
<div class="another-section">
<p>...</p>
<p>...</p>
</div>
<div class="another-another-section">
<p>...</p>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
</body>
What I have tried with:
I have tried to find the <div>
with maximum <p>
but sometimes there are some other <div>
with maximum <p>
, I have to avoid them by finding nearest <h1>
$html=
'[My html above]
';
$HTMLDoc = new DOMDocument();
$HTMLDoc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
$xpath = new DOMXPath($HTMLDoc);
#locate the 3 divs
$pees = $xpath->query('//div[.//p]');
$pchilds = [];
#get the number of p children in each div
foreach ($pees as $pee) {
$childs = $pee->childElementCount;
array_push($pchilds,$childs);}
#now find the div with the max number of p children
foreach ($pees as $pee) {
$childs = $pee->childElementCount;
if ($childs == max($pchilds))
echo ($pee->nodeValue);
#or do whatever
}
Upvotes: 0
Views: 884
Reputation: 1801
In a comment you added,
want to find the first <h1> then I want to find the most nearest <div> having max <p>. There can be another tags in that <div> but I want to print <p> tags only.
If your PHP processor has exslt support something like this should be possible:
< file xmlstarlet select --template \
--var T='//div[p][contains(preceding::h1[1],"my title")]' \
--copy-of '($T[count(p) = math:max(dyn:map($T,"count(p)"))])[1]/p'
where
T
variable selects the div
nodes of interest, assuming
you know, or can extract, the h1
section header textdyn:map
maps each div
to the count of its p
children,
math:max
picks the maximum count($T[…])[1]/p
selects the p
children of the first of possibly
more div
s with a maximum p
countThe command above uses xmlstarlet syntax; to make a single XPath
expression replace $T
(2 places) with T
contents inside parentheses.
It executes the following XSLT stylesheet (add -C
before --template
to list it):
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:math="http://exslt.org/math" xmlns:dyn="http://exslt.org/dynamic" version="1.0" extension-element-prefixes="math dyn">
<xsl:output omit-xml-declaration="yes" indent="no"/>
<xsl:template match="/">
<xsl:variable select="//div[p][contains(preceding::h1[1],"my title")]" name="T"/>
<xsl:copy-of select="($T[count(p) = math:max(dyn:map($T,"count(p)"))])[1]/p"/>
</xsl:template>
</xsl:stylesheet>
Upvotes: 0
Reputation: 5905
XPath 2.0 solution (see SaxonC for PHP support). Find the first first nearest div after <h1>
containing with max <p>
:
//h1/following-sibling::div[p[max(//h1/following-sibling::div/count(p))]][1]
Output :
<div class="blog-post">
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
</div>'
XPath 1.0 approximate solution (could return the wrong div) :
//h1/following-sibling::div[count(./p)>1][count(./p)>count(./preceding-sibling::div[./p][1]/p)][count(./p)>count(./following-sibling::div[./p][1]/p)][1]
Upvotes: 0
Reputation: 12712
Find all divs with p
elements, then counting p
elements inside each, finally getting the first with the max()
count
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$dcnt = array();
// Find all divs following an H1
$divs = $xpath->query('//h1/following-sibling::div');
// Count `p` inside them
foreach($divs as $idx=>$d) {
$cnt = (int) $xpath->evaluate('count(.//p)', $d);
$dcnt[$idx] = $cnt;
}
// show content of div with max() count
foreach($divs as $idx=>$d) {
if( $dcnt[$idx] == max($dcnt) ){
print $idx . ': ' . $divs[$idx]->nodeName . ': ' . $divs[$idx]->nodeValue;
break;
}
}
Upvotes: 1