Zoredache
Zoredache

Reputation: 39593

Loop over DOMDocument

I am following the suggestion from this question Robust, Mature HTML Parser for PHP, about parsing html that may be malformed with DOMDocument.

Is there any easy way to loop over the parsed document? So I would like to loop over html like this.

$html='<ul>
         <li>value1</li>
         <li>value1</li>
         <li>value3
            <p>subvalue</p>
         </li>
        </ul>
        <p>hello world</p>';

$doc = new DOMDocument();
$doc->loadHTML($html);
???
foreach (??? as $node)
{
  print $node->nodeName.':'.$node->nodeValue;
}

And get results somewhat like this.

 ul:
 li:value1
 li:value2
 li:value3
 p:subvalue
 p:hello world

Using $doc->childNodes by itself doesn't really do what I want. Since it doesn't seem to go down to lower branches in the tree. I used the code suggested by halfdan and I get results like this.

html:
html:value1
         value1
         value3
            subvalue

        hello world

Upvotes: 27

Views: 33786

Answers (5)

Eugene Kaurov
Eugene Kaurov

Reputation: 2991

If you need to look through some HTML tag, feel free:

$doc = new DOMDocument;
$doc->loadXML($a);
$nodes = $doc->getElementsByTagName("tr");
$xml = "";
foreach ($nodes as $node) {
    // you can extract here content of some <td> tag
    $xml .= $doc->saveXML($node);
}
var_dump(htmlentities($xml));

Upvotes: 0

Alexis Wilke
Alexis Wilke

Reputation: 20731

One way is to walk the tree as follow:

function next_node($node)
{
    if($node->firstChild != null)
    {
        return $node->firstChild;
    }

    if($node->nextSibling != null)
    {
        return $node->nextSibling;
    }

    for($node = $node->parentNode; $node != null; $node = $node->parentNode)
    {
        if($node->nextSibling != null)
        {
            return $node->nextSibling;
        }
    }

    return null;
}

for($node = $doc; $node != null; $node = next_node($node))
{
    // handle node (read-only mode, if you need read-write
    // you have to save all the nodes in an array and then
    // use that array
    //
    ...
}

This works for most documents, however it looks like at times the parentNode is somehow not correctly set and the next_node() function ends up returning the wrong information.

Upvotes: 2

halfdan
halfdan

Reputation: 34214

Try this:

$doc = new DOMDocument();
$doc->loadHTML($html);
showDOMNode($doc);

function showDOMNode(DOMNode $domNode) {
    foreach ($domNode->childNodes as $node)
    {
        print $node->nodeName.':'.$node->nodeValue;
        if($node->hasChildNodes()) {
            showDOMNode($node);
        }
    }    
}

Upvotes: 46

JustAC0der
JustAC0der

Reputation: 3149

You need to use PHP Simple HTML DOM Parser and the following code:

<?php
require_once 'simplehtmldom/simple_html_dom.php';

function iterateHtmlElements($html)
{
    $dom = str_get_html($html);
    $dom->set_callback('handleElement');
    $dom->__toString();
    echo "\n";
}

function handleElement(simple_html_dom_node $elem)
{
    if($elem->tag == 'text') {
        echo $elem->innertext();
    }
    else {
        echo "\n" . $elem->tag . ": ";
    }
}

$html='<ul>
         <li>value1</li>
         <li>value1</li>
         <li>value3
            <p>subvalue</p>
         </li>
        </ul>
        <p>hello world</p>';
iterateHtmlElements($html);

It works exactly as expected. I checked it with the input you provided and got the following results:

> php test2.php

ul:
li: value1
li: value1
li: value3
p: subvalue
p: hello world

Upvotes: 1

Drunken Peacock
Drunken Peacock

Reputation: 45

I was having issues with elements that had c data, where even elements that didn't have children where returning that they did.

I am not sure why it was.

The work around I found was to change

if($node->hasChildNodes()) {
        showDOMNode($node);
    }

to

if($node->childNodes->length != 1) {
        showDOMNode($node);
    }

And the code now works perfectly.

Upvotes: 2

Related Questions