Reputation: 3730
The PHP simple dom allows us to take an HTML page and extract only the markup, leaving the text. Like this:
echo file_get_html('http://www.google.com/')->plaintext;
I am looking for the opposite of that method. Remove all of the text and preserve only the tags. Does that exist? I can't seem to find any reference.
Upvotes: 1
Views: 641
Reputation: 19482
In the actual W3C DOM api (not SimpleHtmlDom) anything is a node, not only the element nodes. With XPath you can select them using the text()
function.
$html = <<<'HTML'
<html><body>
<div>
TEXT<div>TEXT</div>
</div>
</body></html>
HTML;
$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);
foreach ($xpath->evaluate('//text()') as $node) {
$node->parentNode->removeChild($node);
}
echo $document->saveHtml($document->documentElement);
Output:
<html><body><div><div></div></div></body></html>
Upvotes: 1
Reputation: 3145
Dont use any search and replace php function or regexp. They're meant to parse and minupulate strings and larger texts. Use something in line of HTML DOM parsing.
http://simplehtmldom.sourceforge.net/manual.htm
For example, to find all the img tags in an html document you'd do the following:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
if you go to the URL below you'll learn how to find html elements within a loaded html page
http://simplehtmldom.sourceforge.net/manual.htm#section_find
this is the most efficient way of going about it. this has a built in finder to locate html elements and to form it to your needs.
Upvotes: 0