enrico pax
enrico pax

Reputation: 193

Get contents of BODY without DOCTYPE, HTML, HEAD and BODY tags

What I am trying to do is include an HTML file within a PHP system (not a problem) but that HTML file also needs to be usable on its own, for various reasons, so I need to know how I can strip the doctype, html, head and body tags in the context of the PHP include, if that's possible.

I'm not particularly good at PHP (doh!) so my searches of the php manual and on the web hasn't made me figure this out. Meaning that any help or reading tips, or both, are much appreciated.

Upvotes: 19

Views: 34268

Answers (8)

insign
insign

Reputation: 5773

As miken32 said:

Hey why not answer a 9 year old question? PHP version 5.4 (released 3 years after this question was asked) added the options parameter to DomDocument::loadHTML(). With it you can do this:

$dom = new DomDocument();
$dom->loadHTML($string, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

// do stuff

echo $dom->saveHTML();

We pass two constants: LIBXML_HTML_NODEFDTD says not to add a document type definition, and LIBXML_HTML_NOIMPLIED says not to add implied elements like <html> and <body>.

Upvotes: 7

Luca Vizzi
Luca Vizzi

Reputation: 31

A solution with only one instance of DOMDocument and without loops

$d = new DOMDocument();
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
echo $d->saveHTML($body);

Upvotes: 1

Patrick
Patrick

Reputation: 169

$site = file_get_contents("http://www.google.com/");

preg_match("/<body[^>]*>(.*?)<\/body>/is", $site, $matches);

echo($matches[1]);

Upvotes: 15

user5513910
user5513910

Reputation:

This may be a solution. I tried it and it works fine.

function parseHTML(string) {
      var   parser = new DOMParser
     , result = parser.parseFromString(string, "text/html");
      return result.firstChild.lastChild.firstChild;
    }

Upvotes: -1

lubosdz
lubosdz

Reputation: 4500

You may want to use PHP tidy extension which can fix invalid XHTML structures (in which case DOMDocument load crashes) and also extract body only:

$tidy = new tidy();
$htmlBody = $tidy->repairString($html, array(
    'output-xhtml' => true,
    'show-body-only' => true,
), 'utf8');

Then load extracted body into DOMDocument:

$xml = new DOMDocument();
$xml->loadHTML($htmlBody);

Then traverse, extract, move around XML nodes etc .. and save:

$output = $xml->saveXML();

Upvotes: 2

tobyodavies
tobyodavies

Reputation: 28099

Use a DOM parser. this is not tested but ought to do what you want

$domDoc = new DOMDocument();
$domDoc.loadHTMLFile('/path/to/file');
$body = $domDoc->GetElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
    echo $child->C14N(); //Note this cannonicalizes the representation of the node, but that's not necessarily a bad thing
}

If you want to avoid cannonicalization, you can use this version (thanks to @Jared Farrish)

Upvotes: 1

Ja͢ck
Ja͢ck

Reputation: 173562

Use DOMDocument to keep what you need rather than strip what you don't need (PHP >= 5.3.6)

$d = new DOMDocument;
$d->loadHTMLFile($fileLocation);
$body = $d->getElementsByTagName('body')->item(0);
// perform innerhtml on $body by enumerating child nodes 
// and saving them individually
foreach ($body->childNodes as $childNode) {
  echo $d->saveHTML($childNode);
}

Upvotes: 5

Jared Farrish
Jared Farrish

Reputation: 49198

Since the substr() method seemed to be too much for some to swallow, here is a DOM parser method:

$d = new DOMDocument;
$mock = new DOMDocument;
$d->loadHTML(file_get_contents('/path/to/my.html'));
$body = $d->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
    $mock->appendChild($mock->importNode($child, true));
}

echo $mock->saveHTML();

http://codepad.org/MQVQ3XQP

Anybody wish to see that "other one", see the revisions.

Upvotes: 24

Related Questions