Reputation: 3481
Using http://simplehtmldom.sourceforge.net/ I know this could extract the html text:
<?php
include('simple_html_dom.php');
// Create DOM from URL
echo file_get_html('http://www.google.com/')->plaintext;
?>
But how to delete all the text?
For example, if I have this input HTML:
<html>
<head>
<title>Example</title>
</head>
<body>
<h1>Lore Ipsum</h1>
<p>
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.<br/>
Aenean <em>commodo</em> ligula eget dolor. Aenean massa.
</p>
</body>
</html>
I would like to get this output with SimpleHtmlDom:
<html>
<head>
<title></title>
</head>
<body>
<h1></h1>
<p><br/></p>
</body>
</html>
In other words, I want to keep the structure of the document only.
Please help.
Upvotes: 1
Views: 1392
Reputation: 316969
I don't know for sure how to do that with SimpleHtmlDom. From it's manual, I'd assume something like
$html = file_get_html('http://www.google.com/');
foreach( $html->find('text') as $text) {
$text->plaintext = '';
}
However, you can also use PHP's native DOM parser. It can do XPath queries and should in general be a good deal faster:
libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.google.com');
$xp = new DOMXPath($dom);
foreach ($xp->query('//text()') as $textNode) {
$textNode->parentNode->removeChild($textNode);
}
$dom->formatOutput = TRUE;
echo $dom->saveXML($dom->documentElement);
Upvotes: 3
Reputation: 5432
innertext
Property of HTML Element to the Empty StringUsing simplehtmldom.php:
$my_html = file_get_html('http://www.google.com/');
$my_html->innertext = "";
Upvotes: 1