Kevin
Kevin

Reputation: 3730

parse html and remove text preserving TAGS

The PHP simple dom allows us to take an HTML page and extract only the markup, leaving the text. Like this:

echo file_get_html('http://www.google.com/')->plaintext;

I am looking for the opposite of that method. Remove all of the text and preserve only the tags. Does that exist? I can't seem to find any reference.

Upvotes: 1

Views: 641

Answers (2)

ThW
ThW

Reputation: 19482

In the actual W3C DOM api (not SimpleHtmlDom) anything is a node, not only the element nodes. With XPath you can select them using the text() function.

$html = <<<'HTML'
<html><body>
<div>
  TEXT<div>TEXT</div>
</div>
</body></html>
HTML;

$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);

foreach ($xpath->evaluate('//text()') as $node) {
  $node->parentNode->removeChild($node);
}

echo $document->saveHtml($document->documentElement);

Output:

<html><body><div><div></div></div></body></html>

Upvotes: 1

unixmiah
unixmiah

Reputation: 3145

Dont use any search and replace php function or regexp. They're meant to parse and minupulate strings and larger texts. Use something in line of HTML DOM parsing.

http://simplehtmldom.sourceforge.net/manual.htm

For example, to find all the img tags in an html document you'd do the following:

// Create DOM from URL or file
 $html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>'; 

if you go to the URL below you'll learn how to find html elements within a loaded html page

http://simplehtmldom.sourceforge.net/manual.htm#section_find

this is the most efficient way of going about it. this has a built in finder to locate html elements and to form it to your needs.

Upvotes: 0

Related Questions