stckvrw
stckvrw

Reputation: 1791

How to remove in PHP outer tags from a node

I have the following html code:

$pageHTML = '<html>
<head></head>
<body>
<div class="some class">
<header>Header</header>
<section>Section</section>
<footer>Footer</footer>
</div>
</body>
</html>';

and I need to remove outer tags of the <div> keeping all its inner HTML inside of the <body>

If I try

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($pageHTML);
libxml_use_internal_errors(false);

$bodyDivs = [];
foreach($dom->getElementsByTagName('body')[0]->childNodes as $bodyChild) {
    if($bodyChild->nodeName == 'div') {
        $bodyDivs[] = $bodyChild;
    }
}

if(count($bodyDivs) == 1) {
    foreach($bodyDivs[0]->childNodes as $divChild) {
        $dom->getElementsByTagName('body')[0]->appendChild($divChild);
    }
    $dom->getElementsByTagName('body')[0]->removeChild($bodyDivs[0]);
}

the div is being removed but without appending its childs to <body> before the removing

If I try a reverse loop like

$k = count($bodyDivs[0]->childNodes);
for($n = $k-1; $n >= 0; $n--) {
    $dom->getElementsByTagName('body')[0]->appendChild($bodyDivs[0]->childNodes[$n]);
}
$dom->getElementsByTagName('body')[0]->removeChild($bodyDivs[0]);

the childs are being added to the body, but in reverse order

So I get

<body>
<footer>Footer</footer>
<section>Section</section>
<header>Header</header>
</body>

but I need

<body>
<header>Header</header>
<section>Section</section>
<footer>Footer</footer>
</body>

How to resolve the problem?

Upvotes: 0

Views: 509

Answers (2)

salathe
salathe

Reputation: 51950

Your original code is very close, just missing one key point.

Original code

foreach($bodyDivs[0]->childNodes as $divChild) {
    $dom->getElementsByTagName('body')[0]->appendChild($divChild);
}

Trying to foreach a list of nodes, while also removing nodes from that same list (in your case, moving them to the <body>), does not behave as you intended.

Simplified, complete example for demonstration purposes:

<?php
$doc = new DOMDocument;
$doc->loadXML('<example><a/><b/><c/><d/><e/></example>');
$parent = $doc->documentElement;
foreach ($parent->childNodes as $child) {
    $parent->removeChild($child);
}
echo $doc->saveXML();

This outputs the following:

<?xml version="1.0"?>
<example><b/><c/><d/><e/></example>

Totally sensible, right?! Fear not, we can do better.

What to do?

A common approach, that does behave as intended, is to loop over the list until it is empty.

<?php
$doc = new DOMDocument;
$doc->loadXML('<example><a/><b/><c/><d/><e/></example>');
$parent = $doc->documentElement;
while ($parent->childNodes->length > 0) {
    $child = $parent->childNodes->item(0);
    $parent->removeChild($child);
}
echo $doc->saveXML();

Applied to your code

All of the above means that your original foreach:

foreach($bodyDivs[0]->childNodes as $divChild) {
    $dom->getElementsByTagName('body')[0]->appendChild($divChild);
}

Can be replaced with a while loop.

while ($bodyDivs[0]->childNodes->length > 0) {
    $divChild = $bodyDivs[0]->childNodes->item(0);
    $dom->getElementsByTagName('body')->item(0)->appendChild($divChild);
}

Aside: I used the ->item(0) notation above, as that's more conventional.

Upvotes: 1

stckvrw
stckvrw

Reputation: 1791

Ok, I've found my own solution but maybe someone will post more elegant:

if(count($bodyDivs) == 1) {

    $count = count($bodyDivs[0]->childNodes);

    $arr = [];
    for($n = $count-1; $n >= 0; $n--) {
        $arr[] = $bodyDivs[0]->childNodes[$n];
    }

    for($n = $count-1; $n >= 0; $n--) {
        $dom->getElementsByTagName('body')[0]->appendChild($arr[$n]);
    }

    $dom->getElementsByTagName('body')[0]->removeChild($bodyDivs[0]);
}

echo str_replace("\n\r", "", $dom->saveHTML((new \DOMXPath($dom))->query('/')->item(0)));

Upvotes: 0

Related Questions