Reputation: 9574

How to avoid DOM parsing adding html doctype, <head> and <body> tags?

<?
    $string = '
    Some photos<br>
    <span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />    
    ';

    $dom = new DOMDocument();
    $dom->loadHTML($string);
    $dom->preserveWhiteSpace = false;
    $elements = $dom->getElementsByTagName('span');
    $spans = array();
    foreach($elements as $span) {
        $spans[] = $span;
    }
    foreach($spans as $span) {
        $span->parentNode->removeChild($span);
    }
    echo $dom->saveHTML();


?>

I'm using this code to parse strings. When string is returned by this function, it has some added tags:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Some photos<br><br><br><br><br></p></body></html>

Is there any way to avoid this and to have clean string returned? This input string is just for example, in usage it can be any html string.

Upvotes: 10

Answers (6)

miken32

Reputation: 42697

PHP versions since 5.4, when compiled with Libxml 2.6.0 or later, can use the the options parameter to DomDocument::loadHTML(). With it you can do this:

$dom = new \DomDocument();
$dom->loadHTML($string, \LIBXML_HTML_NODEFDTD | \LIBXML_HTML_NOIMPLIED);
// do stuff
echo $dom->saveHTML();

We pass two libxml constants: LIBXML_HTML_NODEFDTD says not to add a document type definition, and LIBXML_HTML_NOIMPLIED says not to add implied elements like <html> and <body>.

Upvotes: 21

meder omuraliev

Reputation: 186562

I'm actually looking for the same solution. I've been using the following method to do this, however the <p> around the text node will still be added when you do loadHTML(). I don't there's a way to get around that without using another parser, or there's some hidden flag to tell it to not do that.

This code:

<?php

function innerHTML($node){
  $doc = new DOMDocument();
  foreach ($node->childNodes as $child)
    $doc->appendChild($doc->importNode($child, true));

  return $doc->saveHTML();
}

 $string = '
    Some photos<br>
    <span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
    <span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />    
    ';

    $dom = new DOMDocument();
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML($string);
    $elements = $dom->getElementsByTagName('span');
    $spans = array();
    foreach($elements as $span) {
        $spans[] = $span;
    }
    foreach($spans as $span) {
        $span->parentNode->removeChild($span);
    }

    echo innerHTML( $dom->documentElement->firstChild );

Will output:

<p>Some photos<br><br><br><br><br></p>

However of course this solution does not keep the markup 100% intact, but it's close.

Upvotes: 11

StevenFlecha

Reputation: 61

From the manual: http://php.net/manual/en/domdocument.savehtml.php

$html_fragment = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()));

Works for me.

Upvotes: -1

Don

Reputation: 31

After using loadHTML, you can do this:

# loadHTML causes a !DOCTYPE tag to be added, so remove it:
$dom->removeChild($dom->firstChild);

# it also wraps the code in <html><body></body></html>, so remove that:
$dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);

The !DOCTYPE tag will be removed, and the first tag inside the body tag will replace the html tag.

Obviously, this will only work if you're only interested in the first tag inside the body, as I was when I encountered this problem. But this example could be adapted to copy everything inside the body with a little bit of effort.

Edit: Meh, nevermind. I like meder's solution.

Upvotes: 3

nickf

Reputation: 546035

You could always just use a regex to strip that first bit out:

echo preg_replace("/<!DOCTYPE [^>]+>/", "", $dom->saveHTML());

Upvotes: 0

nickf

Reputation: 546035

I'm not sure if either of these will actually work, but you could try using DOMImplementation::createDocument when constructing your DOMDocument - the third argument is the DOCTYPE you wish to use.

Also, instead of saveHTML(), you could try saveXML()

Upvotes: -2

How to avoid DOM parsing adding html doctype, &lt;head&gt; and &lt;body&gt; tags?

Answers (6)

Related Questions

How to avoid DOM parsing adding html doctype, <head> and <body> tags?