Reputation: 193
What I am trying to do is include an HTML file within a PHP system (not a problem) but that HTML file also needs to be usable on its own, for various reasons, so I need to know how I can strip the doctype, html, head and body tags in the context of the PHP include, if that's possible.
I'm not particularly good at PHP (doh!) so my searches of the php manual and on the web hasn't made me figure this out. Meaning that any help or reading tips, or both, are much appreciated.
Upvotes: 19
Views: 34268
Reputation: 5773
Hey why not answer a 9 year old question? PHP version 5.4 (released 3 years after this question was asked) added the
options
parameter toDomDocument::loadHTML()
. With it you can do this:
$dom = new DomDocument();
$dom->loadHTML($string, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
// do stuff
echo $dom->saveHTML();
We pass two constants: LIBXML_HTML_NODEFDTD
says not to add a document type definition, and LIBXML_HTML_NOIMPLIED
says not to add implied elements like <html>
and <body>
.
Upvotes: 7
Reputation: 31
A solution with only one instance of DOMDocument and without loops
$d = new DOMDocument();
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
echo $d->saveHTML($body);
Upvotes: 1
Reputation: 169
$site = file_get_contents("http://www.google.com/");
preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);
echo($matches[1]);
Upvotes: 15
Reputation:
This may be a solution. I tried it and it works fine.
function parseHTML(string) {
var parser = new DOMParser
, result = parser.parseFromString(string, "text/html");
return result.firstChild.lastChild.firstChild;
}
Upvotes: -1
Reputation: 4500
You may want to use PHP tidy extension which can fix invalid XHTML structures (in which case DOMDocument load crashes) and also extract body only:
$tidy = new tidy();
$htmlBody = $tidy->repairString($html, array(
'output-xhtml' => true,
'show-body-only' => true,
), 'utf8');
Then load extracted body into DOMDocument:
$xml = new DOMDocument();
$xml->loadHTML($htmlBody);
Then traverse, extract, move around XML nodes etc .. and save:
$output = $xml->saveXML();
Upvotes: 2
Reputation: 28099
Use a DOM parser. this is not tested but ought to do what you want
$domDoc = new DOMDocument();
$domDoc.loadHTMLFile('/path/to/file');
$body = $domDoc->GetElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
echo $child->C14N(); //Note this cannonicalizes the representation of the node, but that's not necessarily a bad thing
}
If you want to avoid cannonicalization, you can use this version (thanks to @Jared Farrish)
Upvotes: 1
Reputation: 173562
Use DOMDocument to keep what you need rather than strip what you don't need (PHP >= 5.3.6)
$d = new DOMDocument;
$d->loadHTMLFile($fileLocation);
$body = $d->getElementsByTagName('body')->item(0);
// perform innerhtml on $body by enumerating child nodes
// and saving them individually
foreach ($body->childNodes as $childNode) {
echo $d->saveHTML($childNode);
}
Upvotes: 5
Reputation: 49198
Since the substr()
method seemed to be too much for some to swallow, here is a DOM parser method:
$d = new DOMDocument;
$mock = new DOMDocument;
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$mock->appendChild($mock->importNode($child, true));
}
echo $mock->saveHTML();
Anybody wish to see that "other one", see the revisions.
Upvotes: 24