user13957687
user13957687

Reputation:

XPath - How to extract element following by the parent parent H1 tag

I am trying to extract blog post from web pages. Different pages have different structure so it is very difficult to extract what I need. There are some CSS and JS code in the HTML section also, I have to avoid them.

<body>
      <div>
         <h1> a dummy title </h1>
      </div>

            <script> function loadDoc() {const xhttp = new XMLHttpRequest();} </script>

        <div class="subtitle">
            <p>...</p>
        </div>

        <div class="blog-post">
            <p>...</p>
                <div class="clear-fix">...</div>
            <p>...</p>
                <div class="clear-fix">...</div>
            <p>...</p>
            <p>...</p>
       </div>

       <div class="another-section">
            <p>...</p>
            <p>...</p>
       </div>

       <div class="another-another-section">
            <p>...</p>
            <p>...</p>
                <div class="clear-fix">...</div>
            <p>...</p>
            <p>...</p>
            <p>...</p>
       </div>
</body>

What I have tried with:
I have tried to find the <div> with maximum <p> but sometimes there are some other <div> with maximum <p>, I have to avoid them by finding nearest <h1>

$html= 
'[My html above]

';
$HTMLDoc = new DOMDocument();
$HTMLDoc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );   
$xpath = new DOMXPath($HTMLDoc);

#locate the 3 divs
$pees = $xpath->query('//div[.//p]');
$pchilds = [];

#get the number of p children in each div
foreach ($pees as $pee) {
    $childs = $pee->childElementCount;
    array_push($pchilds,$childs);}

#now find the div with the max number of p children
foreach ($pees as $pee) {
    $childs = $pee->childElementCount;
    if ($childs == max($pchilds))
       echo ($pee->nodeValue);
       #or do whatever
}

Upvotes: 0

Views: 884

Answers (3)

urznow
urznow

Reputation: 1801

In a comment you added,

want to find the first <h1> then I want to find the most nearest <div> having max <p>. There can be another tags in that <div> but I want to print <p> tags only.

If your PHP processor has support something like this should be possible:

  < file xmlstarlet select --template \
  --var T='//div[p][contains(preceding::h1[1],"my title")]' \
  --copy-of '($T[count(p) = math:max(dyn:map($T,"count(p)"))])[1]/p'

where

  • the T variable selects the div nodes of interest, assuming you know, or can extract, the h1 section header text
  • dyn:map maps each div to the count of its p children, math:max picks the maximum count
  • ($T[…])[1]/p selects the p children of the first of possibly more divs with a maximum p count

The command above uses syntax; to make a single XPath expression replace $T (2 places) with T contents inside parentheses. It executes the following XSLT stylesheet (add -C before --template to list it):

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:math="http://exslt.org/math" xmlns:dyn="http://exslt.org/dynamic" version="1.0" extension-element-prefixes="math dyn">
  <xsl:output omit-xml-declaration="yes" indent="no"/>
  <xsl:template match="/">
    <xsl:variable select="//div[p][contains(preceding::h1[1],&quot;my title&quot;)]" name="T"/>
    <xsl:copy-of select="($T[count(p) = math:max(dyn:map($T,&quot;count(p)&quot;))])[1]/p"/>
  </xsl:template>
</xsl:stylesheet>

Upvotes: 0

E.Wiest
E.Wiest

Reputation: 5905

XPath 2.0 solution (see SaxonC for PHP support). Find the first first nearest div after <h1> containing with max <p> :

//h1/following-sibling::div[p[max(//h1/following-sibling::div/count(p))]][1]

Output :

<div class="blog-post">
            <p>...</p>
                <div class="clear-fix">...</div>
            <p>...</p>
                <div class="clear-fix">...</div>
            <p>...</p>
            <p>...</p>
       </div>'

XPath 1.0 approximate solution (could return the wrong div) :

//h1/following-sibling::div[count(./p)>1][count(./p)>count(./preceding-sibling::div[./p][1]/p)][count(./p)>count(./following-sibling::div[./p][1]/p)][1]

Upvotes: 0

LMC
LMC

Reputation: 12712

Find all divs with p elements, then counting p elements inside each, finally getting the first with the max() count

$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);

$dcnt = array();
// Find all divs following an H1
$divs = $xpath->query('//h1/following-sibling::div');

// Count `p` inside them
foreach($divs as $idx=>$d) {
    $cnt = (int) $xpath->evaluate('count(.//p)', $d);
    $dcnt[$idx] = $cnt;
}

// show content of div with max() count
foreach($divs as $idx=>$d) {
   if( $dcnt[$idx] == max($dcnt) ){
        print $idx . ': ' . $divs[$idx]->nodeName . ': ' . $divs[$idx]->nodeValue;
        break;
   }
}

Upvotes: 1

Related Questions