Michael Ecklund
Michael Ecklund

Reputation: 1256

Parse HTML and Get All h3's After an h2 Before the Next h2 Using PHP

I am looking to find the first h2 in the article. Once found, look for all h3's until the next h2 is found. Rinse and repeat until all headings and subheadings have been located.

Before you immediately flag or close this question as duplicate parsing question, please take note of the question title, as for this isn't about basic node retrieval. I've got that part down.

I am using DOMDocument to parse HTML using DOMDocument::loadHTML(), DOMDocument::getElementsByTagName() and DOMDocument::saveHTML() to retrieve the important headings of an article.

My code is as follows:

$matches = array();
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach($dom->getElementsByTagName('h2') as $node) {
    $matches['heading-two'][] = $dom->saveHtml($node);
}
foreach($dom->getElementsByTagName('h3') as $node) {
    $matches['heading-three'][] = $dom->saveHtml($node);
}
if($matches){
    $this->key_points = $matches;
}

Which gives me an output of something like:

array(
    'heading-two' => array(
        '<h2>Here is the first heading two</h2>',
        '<h2>Here is the SECOND heading two</h2>'
    ),
    'heading-three' => array(
        '<h3>Here is the first h3</h3>',
        '<h3>Here is the second h3</h3>',
        '<h3>Here is the third h3</h3>',
        '<h3>Here is the fourth h3</h3>',
    )
);

I'm looking to have something more like:

array(
    '<h2>Here is the first heading two</h2>' => array(
        '<h3>Here is an h3 under the first h2</h3>',
        '<h3>Here is another h3 found under first h2, but after the first h3</h3>'
    ),
    '<h2>Here is the SECOND heading two</h2>' => array(
        '<h3>Here is an h3 under the SECOND h2</h3>',
        '<h3>Here is another h3 found under SECOND h2, but after the first h3</h3>'
    )
);

I'm not exactly looking for code completion (if you feel it would better help others by doing so -- go ahead), but more or less guidance or advice in the right direction to accomplish a nested array like directly above above.

Upvotes: 6

Views: 3724

Answers (2)

userabuser
userabuser

Reputation: 443

This would also work by getting the line number for which the node element was found in the document and storing it as the array element key, you then ksort($matches) to return each node element in the array to their original line position as it would have been found in the HTML document.

$matches = array();
$dom = new DOMDocument;
$dom->loadHTML($content);

foreach($dom->getElementsByTagName('h2') as $node) {
    $matches[$node->getLineNo()] = $dom->saveHtml($node);
}
foreach($dom->getElementsByTagName('h3') as $node) {
    $matches[$node->getLineNo()] = $dom->saveHtml($node);
}

ksort($matches);

...or a little tighter code;

foreach(array('h2', 'h3') as $tag) {
    foreach($dom->getElementsByTagName($tag) as $node) {
        $matches[$node->getLineNo()] = $dom->saveHtml($node);
    }
}

ksort($matches);

Upvotes: 2

dev-null-dweller
dev-null-dweller

Reputation: 29472

I assume that all headings are on the same level in DOM, so every h3 is sibling of h2. With that assumption , you can iterate over siblings of h2 until next h2 is encountered:

foreach($dom->getElementsByTagName('h2') as $node) {
    $key = $dom->saveHtml($node);
    $matches[$key] = array();
    while(($node = $node->nextSibling) && $node->nodeName !== 'h2') {
        if($node->nodeName == 'h3') {
            $matches[$key][] = $dom->saveHtml($node);   
        }
    }
}

Upvotes: 11

Related Questions