Reputation: 9574
<?
$string = '
Some photos<br>
<span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />
';
$dom = new DOMDocument();
$dom->loadHTML($string);
$dom->preserveWhiteSpace = false;
$elements = $dom->getElementsByTagName('span');
$spans = array();
foreach($elements as $span) {
$spans[] = $span;
}
foreach($spans as $span) {
$span->parentNode->removeChild($span);
}
echo $dom->saveHTML();
?>
I'm using this code to parse strings. When string is returned by this function, it has some added tags:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Some photos<br><br><br><br><br></p></body></html>
Is there any way to avoid this and to have clean string returned? This input string is just for example, in usage it can be any html string.
Upvotes: 10
Views: 9118
Reputation: 42697
PHP versions since 5.4, when compiled with Libxml 2.6.0 or later, can use the the options
parameter to DomDocument::loadHTML()
. With it you can do this:
$dom = new \DomDocument();
$dom->loadHTML($string, \LIBXML_HTML_NODEFDTD | \LIBXML_HTML_NOIMPLIED);
// do stuff
echo $dom->saveHTML();
We pass two libxml constants: LIBXML_HTML_NODEFDTD
says not to add a document type definition, and LIBXML_HTML_NOIMPLIED
says not to add implied elements like <html>
and <body>
.
Upvotes: 21
Reputation: 186562
I'm actually looking for the same solution. I've been using the following method to do this, however the <p>
around the text node will still be added when you do loadHTML()
. I don't there's a way to get around that without using another parser, or there's some hidden flag to tell it to not do that.
This code:
<?php
function innerHTML($node){
$doc = new DOMDocument();
foreach ($node->childNodes as $child)
$doc->appendChild($doc->importNode($child, true));
return $doc->saveHTML();
}
$string = '
Some photos<br>
<span class="naslov_slike">photo_by_ile_IMG_1676-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1699-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1697-01</span><br />
<span class="naslov_slike">photo_by_ile_IMG_1695-01</span><br />
';
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($string);
$elements = $dom->getElementsByTagName('span');
$spans = array();
foreach($elements as $span) {
$spans[] = $span;
}
foreach($spans as $span) {
$span->parentNode->removeChild($span);
}
echo innerHTML( $dom->documentElement->firstChild );
Will output:
<p>Some photos<br><br><br><br><br></p>
However of course this solution does not keep the markup 100% intact, but it's close.
Upvotes: 11
Reputation: 61
From the manual: http://php.net/manual/en/domdocument.savehtml.php
$html_fragment = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()));
Works for me.
Upvotes: -1
Reputation: 31
After using loadHTML, you can do this:
# loadHTML causes a !DOCTYPE tag to be added, so remove it:
$dom->removeChild($dom->firstChild);
# it also wraps the code in <html><body></body></html>, so remove that:
$dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);
The !DOCTYPE
tag will be removed, and the first tag inside the body
tag will replace the html
tag.
Obviously, this will only work if you're only interested in the first tag inside the body
, as I was when I encountered this problem. But this example could be adapted to copy everything inside the body
with a little bit of effort.
Edit: Meh, nevermind. I like meder's solution.
Upvotes: 3
Reputation: 546035
You could always just use a regex to strip that first bit out:
echo preg_replace("/<!DOCTYPE [^>]+>/", "", $dom->saveHTML());
Upvotes: 0
Reputation: 546035
I'm not sure if either of these will actually work, but you could try using DOMImplementation::createDocument
when constructing your DOMDocument
- the third argument is the DOCTYPE
you wish to use.
Also, instead of saveHTML()
, you could try saveXML()
Upvotes: -2