parse html and remove text preserving TAGS

Question

The PHP simple dom allows us to take an HTML page and extract only the markup, leaving the text. Like this:

echo file_get_html('http://www.google.com/')->plaintext;

I am looking for the opposite of that method. Remove all of the text and preserve only the tags. Does that exist? I can't seem to find any reference.

ThW · Accepted Answer

In the actual W3C DOM api (not SimpleHtmlDom) anything is a node, not only the element nodes. With XPath you can select them using the text() function.

$html = <<<'HTML'


  TEXTTEXT


HTML;

$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);

foreach ($xpath->evaluate('//text()') as $node) {
  $node->parentNode->removeChild($node);
}

echo $document->saveHtml($document->documentElement);

Output:

parse html and remove text preserving TAGS

Answers (2)

Related Questions